Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Nonrepeating characters in an RE

by LanX (Saint)
on Aug 17, 2022 at 19:59 UTC ( [id://11146202]=note: print w/replies, xml ) Need Help??


in reply to Re: Nonrepeating characters in an RE
in thread Nonrepeating characters in an RE

TIMTOInterpretation (and I would never call you stupid ;-)

But I like this one, so:

use v5.12; use warnings; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re" if is_uniq(@matches); } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); $re .= $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } for split //, $template; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; }

=>

Match: Mississippi re: (.)(.)(.)\3\2\3\3\2(.)\4\2 Match: Mossossoppo re: (.)(.)(.)\3\2\3\3\2(.)\4\2

if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a *FAIL

Both are not canonical regexes but IMHO far better to maintain.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

update

refactoring: better documentation and variable names

Replies are listed 'Best First'.
Re^3: Nonrepeating characters in an RE
by LanX (Saint) on Aug 18, 2022 at 18:30 UTC
    > if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a *FAIL

    For completeness, here we go.

    But the need for use re 'eval'; surprised me, and might reduce the general usability.

    use v5.12; use warnings; use re 'eval'; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re"; } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); # not sure if that's better readable than before $re = join "", map { $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } } split //, $template; $re .= '(?(?{not is_uniq( @{^CAPTURE} ) }) (*FAIL) )'; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; }

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Here's a variation of your all-in-one regex code here that's designed to run under Perl version 5.8. (I think it'll even run under 5.6, but I can't test this.) It's been tested under Strawberries 5.8.9.5 (32-bit) and 5.30.3.1 (64-bit).

      I was quite pleasantly surprised by its speed. However, the code runs about three times more slowly under 5.30 than under 5.8. I assume this is due to the many modifications made to the regex engine over the years to accommodate Unicode. Any comment on this would be of interest. I added an optimization to avoid the "uniqueness" test if the generated template regex has only one capture group. This is of trivial benefit, but was such low-hanging fruit that I thought it would be a shame not to enjoy it.

      Win8 Strawberry 5.8.9.5 (32) Sat 08/20/2022 10:53:36 C:\@Work\Perl\monks >perl use 5.008; # i think code should work with 5.6, but this is untested use strict; use warnings; use autodie; use List::MoreUtils qw(uniq); use Data::Dump qw(dd pp); # for debug # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = ('', ' ', qw( neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000 )); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $rx_template = templ2regex($template); $rx_template = qr{ \A $rx_template \z }xms; # add appropriate bou +nds # print "template '$template' $rx_template: \n"; <STDIN>; # for de +bug print "template '$template': \n"; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; next WORD unless $word =~ $rx_template; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; sub templ2regex { my ($template, ) = @_; # empty template never matches. # (what is proper match behavior of empty template?) return qr{ (?!) }xms if $template eq ''; my %t2g; # map template char => capture group number my $n_cg = 1; # number of current capture group my $rs = # regex string join ' ', map $t2g{$_} ? "\\$t2g{$_}" : scalar($t2g{$_} = $n_cg++, '(. +)'), split '', $template ; --$n_cg; # undo last post-increment: now true highest capture gro +up use re 'eval'; # print "=== \$template '$template' \$rs $rs \n"; <STDIN>; # for +debug return qr{ $rs }xms if $n_cg < 2; # no uniqueness test needed my $cap_vars = join ', ', map "\$$_", 1 .. $n_cg; # capture vars + active $rs .= "\n(?(?{ $n_cg != uniq $cap_vars }) (?!))"; # add uniquene +ss test # print "=== \$template '$template' \$rs $rs \n"; <STDIN>; # for +debug return qr{ $rs }xms; } # end sub templ2regex() ^Z 354984 words in dictionary 9 scan templates template '': (0 words) template ' ': '2' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' (27 words) template 'neocene': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&*?+*&*': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi': 'mississippi' (1 words) template 'abccbccbddb': 'mississippi' (1 words) template '0000': 'mmmm' 'oooo' (2 words) 9 template scans. total time for all scans: 4 secs.

      Update: Fixed dropped word/too many words in "... pleasantly surprised its speed. I notice, however, that the code ..." in second paragraph.


      Give a man a fish:  <%-{-{-{-<

        I was quite pleasantly surprised its speed. I notice, however, that the code runs about three times more slowly under 5.30 than under 5.8. I assume this is due to the many modifications made to the regex engine over the years to accommodate Unicode. Any comment on this would be of interest.

        3x slower seems an awful lot. Are both perls built with the same options?

        Unicode support could be part of it: it would be worth redoing the timings under 5.30 supplying either /a or /aa as a flag on the regexp. However for these patterns (ASCII input and pattern, no use of \w-style classes) I'd expect the cost between versions to be low, and the benefit of the /a flags to be zero.

        There have also been numerous small changes added in recent years as a result of bugs (mostly found by fuzzers) that had the potential to be security holes. Such changes almost always make things a tiny bit slower, and as you accumulate more and more of them they add up. So it might also be informative to run the regexp tests from 5.30 under 5.8 (which might need some adaptation) to see a sample of the bugs 5.30 doesn't have.

        If you're doing timings, I'd also be interested how my code from 11146164 compares - I'd expect it to win a lot by avoiding the embedded code block, and give back a fraction of that by doing more backtracking. (You should add at least the /s flag to my qr{} for proper comparison.)

        # (what is proper match behavior of empty template?)

        Since we're assuming implicit anchors, I think it should just match the empty string. That wasn't a case I attempted to handle though, and I doubt the OP cares about it.

        > Here's a variation of your all-in-one regex

        > However, the code runs about three times more slowly under 5.30 than under 5.8.

        Please benchmark the older conservative inside-out version too.

        I have a hunch the delay is due to security checks connected to use re 'eval'

        So not embedding the code could avoid these problems.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        (As reference frame, for single 'neocene' template, parent code reports ~400 ms at my PC and 5.32 Perl)

        I think the task relates to items classification in a list (characters in string in this case), but unfortunately I don't see how to efficiently do this in Perl, without overhead of regexes. Not even with PDL to the rescue -- I'm mentioning it because below there are (slow) PDL solution and (fast) another "array language" solution, but PDL seems to lack primitive i.e. built-in functions, or I didn't find them.

        Simple transliteration when string (argument) and "searchlist" are the same, with "replacelist" being any agreed-upon sequence, would do. The fact that "searchlist" items won't be unique, i.e. classes will not be named/numbered consecutively (i.e. some of "replacelist" items will be skipped) doesn't matter if template is transliterated in the same manner:

        no warnings 'misc'; # "Replacement list is longer..." my $tpl = 'abccbccbddb'; my $tpl_fixed = eval qq( \$tpl =~ tr!$tpl!a-z!r ); for ( qw/Mississippi Mossossoppo Panama/ ) { say $tpl_fixed eq eval qq( tr!$_!a-z!r ) ? 'yes' : 'no' }

        and it takes (ahem) ~7s for 'neocene' and '354984si.ngl' (too many string evals?)

        With PDL, the closest thing which comes to mind is that "equality table" of template and matching words should be the same. E.g., for 'neocene' such equality table will be:

        1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 1

        and it's relatively easy to build. Let's pack all words into huge 2D table of bytes (I think it's easy to adjust this solution for shorts, in case of UTF-16 encoded non-ASCII input), padding words with BAD values -- BADs are not equal to themselves and so we'll get correct equality tables, stored in 3D array.

        use strict; use warnings; use feature 'say'; use Data::Dump 'dd'; use Time::HiRes 'time'; use PDL; use List::Util 'reduce'; use constant DICTIONARY => './354984si.ngl'; use constant TEMPLATE => 'neocene'; open my $fh, '<', DICTIONARY; chomp( my @dictionary = <$fh> ); close $fh; my $width = reduce { $a > length $b ? $a : length $b } 0, @dictionary, TEMPLATE; my $height = @dictionary + 1; my $words = zeroes byte, $width, $height; my $data = pack "(A$width)*", @dictionary, TEMPLATE; ${ $words-> get_dataref } = $data; $words-> upd_data; my $t = time; $words-> inplace-> setvaltobad( 32 ); my $table = $words-> dummy( 0, $width ) == $words-> dummy( 1, $width ); $words-> inplace-> setbadtoval( 32 ); $table-> badflag( 0 ); my $mask = ( $table == $table-> slice( '', '', -1 )) -> clump( 2 ) -> bandover -> slice([ 0, -2 ]) -> dummy( 0, $width ); my @found = split ' ', ${ $words-> where( $mask )-> get_dataref }; say time - $t; dd \@found; __END__ 1.04205107688904 ["kaitaka", "lauhala", "metreme", "neocene", "tempete"]

        And that's too slow... (fun, though). Note, this time includes tables preparation for all words, and therefore it won't be so bad if there are many templates, not just one. Plus, to my surprise (they happen with PDL) crude profiling shows ~30% of time is spent on bandover call, which is strange.

        On the other hand, in J REPL:

        i.~ 'Mississippi' 0 1 2 2 1 2 2 1 8 8 1 (i.~~.) 'Mississippi' 0 1 2 2 1 2 2 1 3 3 1

        2nd example is to demonstrate it's easy to have nice consecutive "class numbers", but it hinders performance somewhat and not worth any effort as I said. Then:

        # words =: 'b' freads 'path_to/354984si.ngl' 354984 template =: 'neocene' ] pattern =: i.~ template 0 1 2 3 1 0 1 find =: dyad def '(#~ (x & -:) @ i.~ @ >) y' pattern find words +-------+-------+-------+-------+-------+ |kaitaka|lauhala|metreme|neocene|tempete| +-------+-------+-------+-------+-------+ 25 (6!:2) 'pattern find words' 0.0499341

        The last funny command gets sentence execution time averaged over 25 runs, which is, therefore, just ~50 ms. BTW, if template is not to be anchored (OP disappeared :)) and exact brief remains guesswork), implementing sliding window is not too difficult; as case is sometimes with "array language" it leads to huge amount of redundant work to fill rectangular arrays (in addition to redundancy which already was there), but even then execution remains faster (but with lower ratio) than with Perl's regex, even though re-engine aborts early to avoid useless work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11146202]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-19 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found