Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^5: Nonrepeating characters in an RE

by vr (Curate)
on Aug 22, 2022 at 07:37 UTC ( [id://11146290]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Nonrepeating characters in an RE
in thread Nonrepeating characters in an RE

(As reference frame, for single 'neocene' template, parent code reports ~400 ms at my PC and 5.32 Perl)

I think the task relates to items classification in a list (characters in string in this case), but unfortunately I don't see how to efficiently do this in Perl, without overhead of regexes. Not even with PDL to the rescue -- I'm mentioning it because below there are (slow) PDL solution and (fast) another "array language" solution, but PDL seems to lack primitive i.e. built-in functions, or I didn't find them.

Simple transliteration when string (argument) and "searchlist" are the same, with "replacelist" being any agreed-upon sequence, would do. The fact that "searchlist" items won't be unique, i.e. classes will not be named/numbered consecutively (i.e. some of "replacelist" items will be skipped) doesn't matter if template is transliterated in the same manner:

no warnings 'misc'; # "Replacement list is longer..." my $tpl = 'abccbccbddb'; my $tpl_fixed = eval qq( \$tpl =~ tr!$tpl!a-z!r ); for ( qw/Mississippi Mossossoppo Panama/ ) { say $tpl_fixed eq eval qq( tr!$_!a-z!r ) ? 'yes' : 'no' }

and it takes (ahem) ~7s for 'neocene' and '354984si.ngl' (too many string evals?)

With PDL, the closest thing which comes to mind is that "equality table" of template and matching words should be the same. E.g., for 'neocene' such equality table will be:

1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 1

and it's relatively easy to build. Let's pack all words into huge 2D table of bytes (I think it's easy to adjust this solution for shorts, in case of UTF-16 encoded non-ASCII input), padding words with BAD values -- BADs are not equal to themselves and so we'll get correct equality tables, stored in 3D array.

use strict; use warnings; use feature 'say'; use Data::Dump 'dd'; use Time::HiRes 'time'; use PDL; use List::Util 'reduce'; use constant DICTIONARY => './354984si.ngl'; use constant TEMPLATE => 'neocene'; open my $fh, '<', DICTIONARY; chomp( my @dictionary = <$fh> ); close $fh; my $width = reduce { $a > length $b ? $a : length $b } 0, @dictionary, TEMPLATE; my $height = @dictionary + 1; my $words = zeroes byte, $width, $height; my $data = pack "(A$width)*", @dictionary, TEMPLATE; ${ $words-> get_dataref } = $data; $words-> upd_data; my $t = time; $words-> inplace-> setvaltobad( 32 ); my $table = $words-> dummy( 0, $width ) == $words-> dummy( 1, $width ); $words-> inplace-> setbadtoval( 32 ); $table-> badflag( 0 ); my $mask = ( $table == $table-> slice( '', '', -1 )) -> clump( 2 ) -> bandover -> slice([ 0, -2 ]) -> dummy( 0, $width ); my @found = split ' ', ${ $words-> where( $mask )-> get_dataref }; say time - $t; dd \@found; __END__ 1.04205107688904 ["kaitaka", "lauhala", "metreme", "neocene", "tempete"]

And that's too slow... (fun, though). Note, this time includes tables preparation for all words, and therefore it won't be so bad if there are many templates, not just one. Plus, to my surprise (they happen with PDL) crude profiling shows ~30% of time is spent on bandover call, which is strange.

On the other hand, in J REPL:

i.~ 'Mississippi' 0 1 2 2 1 2 2 1 8 8 1 (i.~~.) 'Mississippi' 0 1 2 2 1 2 2 1 3 3 1

2nd example is to demonstrate it's easy to have nice consecutive "class numbers", but it hinders performance somewhat and not worth any effort as I said. Then:

# words =: 'b' freads 'path_to/354984si.ngl' 354984 template =: 'neocene' ] pattern =: i.~ template 0 1 2 3 1 0 1 find =: dyad def '(#~ (x & -:) @ i.~ @ >) y' pattern find words +-------+-------+-------+-------+-------+ |kaitaka|lauhala|metreme|neocene|tempete| +-------+-------+-------+-------+-------+ 25 (6!:2) 'pattern find words' 0.0499341

The last funny command gets sentence execution time averaged over 25 runs, which is, therefore, just ~50 ms. BTW, if template is not to be anchored (OP disappeared :)) and exact brief remains guesswork), implementing sliding window is not too difficult; as case is sometimes with "array language" it leads to huge amount of redundant work to fill rectangular arrays (in addition to redundancy which already was there), but even then execution remains faster (but with lower ratio) than with Perl's regex, even though re-engine aborts early to avoid useless work.

Replies are listed 'Best First'.
Re^6: Nonrepeating characters in an RE
by rsFalse (Chaplain) on Oct 27, 2022 at 23:29 UTC
    I tried to compare binary strings which encode "equality tables", halves (triangles). I generate these strings through eval-group inside the regex:
    #!/usr/bin/perl -l use warnings; use strict; my $template = 'neocene'; my $equality_triangle_template = ''; $template =~ m/(.).*?(.)(?{ $equality_triangle_template .= $1 eq $2 ? +1 : 0 })(*FAIL)/; my @dict; @dict = <>; # <-- 354984si.ngl chomp @dict; my @good = grep { my $equality_triangle_word = ''; m/(.).*?(.)(?{ $equality_triangle_word .= $1 eq $2 ? 1 : 0 })(*FAI +L)/; $equality_triangle_word eq $equality_triangle_template; } grep length == length $template, @dict; print for @good; ^Z kaitaka lauhala metreme neocene tempete
    P.S.One can use (.).*(.) instead of (.).*?(.), omitting question mark - non-greediness modifier (possibly it is only slightly faster).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11146290]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-20 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found