Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^2: Nonrepeating characters in an RE (updated)

by AnomalousMonk (Archbishop)
on Aug 17, 2022 at 18:48 UTC ( [id://11146199]=note: print w/replies, xml ) Need Help??


in reply to Re: Nonrepeating characters in an RE
in thread Nonrepeating characters in an RE

An interesting statement of the problem. Here's a non-regex approach. (I think a regex approach would be so complex as to be more trouble than it's worth. (Update: No: LanX's regex solution here is IMHO quite simple and maintainable. It's also much faster!))

Withdrawn code. Nothing really wrong with it. Just don't like it.

Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 14:20:47 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; use constant TOKENS => pack('C*', 0 .. 255); open my $fh, '<', DICTIONARY; TEMPLATE: for my $template (qw(neocene abcdbab Mississippi abccbccbddb)) { my $et = essence($template); printf "template '$template' (tokenized %s): \n", pp $et; my $count = 0; WORD: while (my $word = <$fh>) { chomp $word; my $ew = essence($word); next WORD unless $et eq $ew; print " '$word' \n"; ++$count; } # end while WORD printf " (%d words) \n", $count; seek $fh, 0, 0; # for next dictionary pass } # end for TEMPLATE close $fh; sub essence { my ($string, ) = @_; my $essence; my $t = 0; my %seen; $essence .= $seen{$_} ? $seen{$_} : ($seen{$_} = substr TOKENS, $t +++, 1) for split '', $string; return $essence; } ^Z template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words)

Update: Hmmm... On second thought...
In the essence() function, the statement
    $essence .= $seen{$_} ? ... : ... for split '', $string;
is buggy. If the tokens string TOKENS contains the equivalent of a '0' character which is ever actually used as a token, $seen{$_} will be evaluated as false if the same character is seen again and tokenization will be incorrect. I think a proper statement would be
    $essence .= exists $seen{$_} ? ... : ... for split '', $string;
but I haven't tested this.

Update: Niftier code. Avoids use of substr. Probably faster.

Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 18:25:54 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = qw( neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000 ); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $tt = tokenize($template); printf "template '$template' (tokenized %s): \n", pp $tt; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; my $tw = tokenize($word); next WORD unless $tt eq $tw; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; INIT { # begin pre-initialized closure for tokenize() # @tokens array must not contain any false char. my @tokens = grep $_, map chr, 0 .. 0xff; # should be enough :) sub tokenize { my ($string, ) = @_; use warnings FATAL => qw(uninitialized); # guard $t out of range my $t = 0; # must not access beyond @tokens array my $toks; my %seen; $toks .= $seen{$_} ||= $tokens[$t++] for split '', $string; return $toks; # # also works. a bit slower, probably because it # # uses 2 intermediate arrays instead of 1. # my %seen; # return join '', map $seen{$_} ||= $tokens[$t++], split '', $string +; } } # end closure for tokenize() ^Z 354984 words in dictionary 7 scan templates template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&*?+*&*' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template '0000' (tokenized "\0\0\0\0"): 'mmmm' 'oooo' (2 words) 7 template scans. total time for all scans: 68 secs.
Also run under Strawberry 5.30.3.1. (Runs about 16% faster under 5.30 for some reason.)

(Kaitaka, lauhala and etcetera? They all seem to be real words, although not all English.)

WRT the template "Mississippi": if a lower-case "m" had also been present in this template, it would have been treated as a separate character token. Maybe a little more template pre-processing is needed.


Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11146199]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-16 05:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found