Re^2: Nonrepeating characters in an RE (updated)

An interesting statement of the problem. Here's a non-regex approach. (I think a regex approach would be so complex as to be more trouble than it's worth. (Update: No: LanX's regex solution here is IMHO quite simple and maintainable. It's also much faster!))

Withdrawn code. Nothing really wrong with it. Just don't like it.

Win8 Strawberry 5.8.9.5 (32)  Wed 08/17/2022 14:20:47
C:\@Work\Perl\monks
>perl

use strict;
use warnings;

use autodie;

use Data::Dump qw(dd pp);

# word-per-line dictionary.
use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl';

use constant TOKENS => pack('C*', 0 .. 255);

open my $fh, '<', DICTIONARY;

TEMPLATE:
for my $template (qw(neocene abcdbab Mississippi abccbccbddb)) {

    my $et = essence($template);
    printf "template '$template' (tokenized %s): \n", pp $et;

    my $count = 0;
    WORD:
    while (my $word = <$fh>) {
        chomp $word;
        my $ew = essence($word);
        next WORD unless $et eq $ew;
        print "  '$word' \n";
        ++$count;
        }  # end while WORD


    printf "  (%d words) \n", $count;

    seek $fh, 0, 0;  # for next dictionary pass

    }  # end for TEMPLATE

close $fh;

sub essence {

    my ($string,
        ) = @_;

    my $essence;
    my $t = 0;
    my %seen;
    $essence .= $seen{$_} ? $seen{$_} : ($seen{$_} = substr TOKENS, $t
+++, 1)
        for split '', $string;

    return $essence;

    }

^Z
template 'neocene' (tokenized "\0\1\2\3\1\0\1"):
  'kaitaka'
  'lauhala'
  'metreme'
  'neocene'
  'tempete'
  (5 words)
template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"):
  'kaitaka'
  'lauhala'
  'metreme'
  'neocene'
  'tempete'
  (5 words)
template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"):
  'mississippi'
  (1 words)
template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"):
  'mississippi'
  (1 words)
[download]

Update: Hmmm... On second thought...
In the essence() function, the statement
$essence .= $seen{$_} ? ... : ... for split '', $string;
is buggy. If the tokens string TOKENS contains the equivalent of a '0' character which is ever actually used as a token, $seen{$_} will be evaluated as false if the same character is seen again and tokenization will be incorrect. I think a proper statement would be
$essence .= exists $seen{$_} ? ... : ... for split '', $string;
but I haven't tested this.

Update: Niftier code. Avoids use of substr. Probably faster.

Win8 Strawberry 5.8.9.5 (32)  Wed 08/17/2022 18:25:54
C:\@Work\Perl\monks
>perl

use strict;
use warnings;

use autodie;

use Data::Dump qw(dd pp);

# word-per-line dictionary.
use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl';

open my $fh, '<', DICTIONARY;
chomp(my @dictionary = <$fh>);  # remove all newlines
close $fh;

printf "%d words in dictionary \n", scalar @dictionary;

my @templates = qw(
    neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000
    );

printf "%d scan templates \n", scalar @templates;

my $start = time;
my $template_scans;

TEMPLATE:
for my $template (@templates) {

    my $tt = tokenize($template);
    printf "template '$template' (tokenized %s): \n", pp $tt;

    my $count = 0;
    WORD:
    for my $word (@dictionary) {
        chomp $word;
        my $tw = tokenize($word);
        next WORD unless $tt eq $tw;
        print "  '$word' \n";
        ++$count;
        }  # end while WORD

    print "  ($count words) \n";

    ++$template_scans;

    }  # end for TEMPLATE

printf "%d template scans. total time for all scans: %d secs. \n",
        $template_scans, time - $start;

INIT {  # begin pre-initialized closure for tokenize()

# @tokens array must not contain any false char.
my @tokens = grep $_, map chr, 0 .. 0xff;  # should be enough :)

sub tokenize {

    my ($string,
        ) = @_;

    use warnings FATAL => qw(uninitialized);  # guard $t out of range
    my $t = 0;  # must not access beyond @tokens array

    my $toks;
    my %seen;
    $toks .= $seen{$_} ||= $tokens[$t++] for split '', $string;

    return $toks;

  # # also works.  a bit slower, probably because it
  # # uses 2 intermediate arrays instead of 1.
  # my %seen;
  # return join '', map $seen{$_} ||= $tokens[$t++], split '', $string
+;

    }

}  # end closure for tokenize()

^Z
354984 words in dictionary
7 scan templates
template 'neocene' (tokenized "\0\1\2\3\1\0\1"):
  'kaitaka'
  'lauhala'
  'metreme'
  'neocene'
  'tempete'
  (5 words)
template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"):
  'kaitaka'
  'lauhala'
  'metreme'
  'neocene'
  'tempete'
  (5 words)
template '0123101' (tokenized "\0\1\2\3\1\0\1"):
  'kaitaka'
  'lauhala'
  'metreme'
  'neocene'
  'tempete'
  (5 words)
template '&*?+*&*' (tokenized "\0\1\2\3\1\0\1"):
  'kaitaka'
  'lauhala'
  'metreme'
  'neocene'
  'tempete'
  (5 words)
template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"):
  'mississippi'
  (1 words)
template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"):
  'mississippi'
  (1 words)
template '0000' (tokenized "\0\0\0\0"):
  'mmmm'
  'oooo'
  (2 words)
7 template scans. total time for all scans: 68 secs.
[download]

Also run under Strawberry 5.30.3.1. (Runs about 16% faster under 5.30 for some reason.)

(Kaitaka, lauhala and etcetera? They all seem to be real words, although not all English.)

WRT the template "Mississippi": if a lower-case "m" had also been present in this template, it would have been treated as a separate character token. Maybe a little more template pre-processing is needed.

Give a man a fish: <%-{-{-{-<

Comment on Re^2: Nonrepeating characters in an RE (updated) Select or Download Code


Perl Monk, Perl Meditation
	PerlMonks