Re^5: Nonrepeating characters in an RE

(As reference frame, for single 'neocene' template, parent code reports ~400 ms at my PC and 5.32 Perl)

I think the task relates to items classification in a list (characters in string in this case), but unfortunately I don't see how to efficiently do this in Perl, without overhead of regexes. Not even with PDL to the rescue -- I'm mentioning it because below there are (slow) PDL solution and (fast) another "array language" solution, but PDL seems to lack primitive i.e. built-in functions, or I didn't find them.

Simple transliteration when string (argument) and "searchlist" are the same, with "replacelist" being any agreed-upon sequence, would do. The fact that "searchlist" items won't be unique, i.e. classes will not be named/numbered consecutively (i.e. some of "replacelist" items will be skipped) doesn't matter if template is transliterated in the same manner:

no warnings 'misc';             # "Replacement list is longer..."
my $tpl = 'abccbccbddb';
my $tpl_fixed = eval qq( \$tpl =~ tr!$tpl!a-z!r );
for ( qw/Mississippi Mossossoppo Panama/ ) {
    say $tpl_fixed eq eval qq( tr!$_!a-z!r )
        ? 'yes'
        : 'no'
}
[download]

and it takes (ahem) ~7s for 'neocene' and '354984si.ngl' (too many string evals?)

With PDL, the closest thing which comes to mind is that "equality table" of template and matching words should be the same. E.g., for 'neocene' such equality table will be:

1 0 0 0 0 1 0
0 1 0 0 1 0 1
0 0 1 0 0 0 0
0 0 0 1 0 0 0
0 1 0 0 1 0 1
1 0 0 0 0 1 0
0 1 0 0 1 0 1
[download]

and it's relatively easy to build. Let's pack all words into huge 2D table of bytes (I think it's easy to adjust this solution for shorts, in case of UTF-16 encoded non-ASCII input), padding words with BAD values -- BADs are not equal to themselves and so we'll get correct equality tables, stored in 3D array.

use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';
use PDL;
use List::Util 'reduce';

use constant DICTIONARY => './354984si.ngl';
use constant TEMPLATE   => 'neocene';

open my $fh, '<', DICTIONARY;
chomp( my @dictionary = <$fh> );
close $fh;

my $width = reduce 
    { $a > length $b ? $a : length $b } 0, @dictionary, TEMPLATE;
my $height = @dictionary + 1;

my $words = zeroes byte, $width, $height;
my $data = pack "(A$width)*", @dictionary, TEMPLATE;
${ $words-> get_dataref } = $data;
$words-> upd_data;

my $t = time;

$words-> inplace-> setvaltobad( 32 );
my $table = $words-> dummy( 0, $width ) 
             == 
            $words-> dummy( 1, $width );
$words-> inplace-> setbadtoval( 32 );
$table-> badflag( 0 );

my $mask = ( $table == $table-> slice( '', '', -1 ))
    -> clump( 2 )
    -> bandover
    -> slice([ 0, -2 ])
    -> dummy( 0, $width );

my @found = split ' ', 
    ${ $words-> where( $mask )-> get_dataref };

say time - $t;
dd \@found;

__END__

1.04205107688904
["kaitaka", "lauhala", "metreme", "neocene", "tempete"]
[download]

And that's too slow... (fun, though). Note, this time includes tables preparation for all words, and therefore it won't be so bad if there are many templates, not just one. Plus, to my surprise (they happen with PDL) crude profiling shows ~30% of time is spent on bandover call, which is strange.

On the other hand, in J REPL:

   i.~ 'Mississippi'
0 1 2 2 1 2 2 1 8 8 1
   (i.~~.) 'Mississippi'
0 1 2 2 1 2 2 1 3 3 1
[download]

2nd example is to demonstrate it's easy to have nice consecutive "class numbers", but it hinders performance somewhat and not worth any effort as I said. Then:

   # words =: 'b' freads 'path_to/354984si.ngl'
354984
   template =: 'neocene'
   ] pattern =: i.~ template
0 1 2 3 1 0 1
   find =: dyad def '(#~ (x & -:) @ i.~ @ >) y'
   pattern find words
+-------+-------+-------+-------+-------+
|kaitaka|lauhala|metreme|neocene|tempete|
+-------+-------+-------+-------+-------+
   25 (6!:2) 'pattern find words'
0.0499341
[download]

The last funny command gets sentence execution time averaged over 25 runs, which is, therefore, just ~50 ms. BTW, if template is not to be anchored (OP disappeared :)) and exact brief remains guesswork), implementing sliding window is not too difficult; as case is sometimes with "array language" it leads to huge amount of redundant work to fill rectangular arrays (in addition to redundancy which already was there), but even then execution remains faster (but with lower ratio) than with Perl's regex, even though re-engine aborts early to avoid useless work.

Comment on Re^5: Nonrepeating characters in an RE Select or Download Code


"be consistent"
	PerlMonks