Using a hash reference in a regular expression

johngg has asked for the wisdom of the Perl Monks concerning the following question:

The responses to s_guarav1091's question, matching a regular expression, set me off on a search for a more general solution to validating and disambiguating words; I have probably re-invented the wheel again but it was fun. What I wanted to achieve was a way to generate a regular expression from a list of words or choices and a way of pulling out the completed word or choice given an unambiguous string to match. I got something working within a single script so then had a go at making the solution re-usable by putting the code in a module.

The difficulty then was how to generate the completed word or choice from information that had been constructed in the module when the matching was being done back in the calling script. I thought that it might be possible to take a reference to a hash from within the regular expression so that it still had access to the hash data when control had returned to the caller. This worked like a charm but, looking at the regular expression returned, I'm not sure how. Here is the module

# Disambiguate.pm
#
# Package used to validate a response against a list of choices
# such that an unambiguous but incomplete string will be completed
# but ambiguous or incorrect strings will fail. E.g. the choices
# of "clear" and "close" would mean that a response of "c" or "cl"
# is ambiguous but "cle", "clea" or "clear" all unambiguously mean
# "clear".
#

#       ============
package Disambiguate;
#       ============

# Turn on strictures and warnings.
#
use strict;
use warnings;

# Use Carp and Exporter. There is just the one subroutine to export.
#
use Carp;
use Exporter;
our @ISA = qw(Exporter);
our @EXPORT = qw(disambiguate);

# Code reference scoped to the package that generates a regular
# expression that will match any number of characters of a
# choice from the minimum unambiguous string upwards.
#
#   --------------
my $rcRegexForThis = sub
#   --------------
{
    # Get list reference of choice fragments, e.g. for "clear"
    # above they would be "cle", "a" and "r". Initialise regular
    # expression string.
    #
    my $rlFragments = shift;
    my $regexStr = q();

    # While there is more than one fragment left, "pop" off the
    # rightmost and place it and any previously popped-off and
    # processed fragments in a non-capturing group with a quantifier
    # of "?" (0 or 1 of). Thus first time through we get "(?:r)?" and
    # the next we get "(?:a(?:r)?)?".
    #
    while (scalar @$rlFragments > 1)
    {
        my $fragment = pop @$rlFragments;
    $regexStr = '(?:' . $fragment . $regexStr . ')?'
    }

    # Prepend the last (leftmost) fragment, which is the minimum
    # unambiguous one, to the regular expression, giving in our
    # example "cle(?:a(?:r)?)?". Return this string.
    #
    $regexStr = $rlFragments->[0] . $regexStr;
    return $regexStr;
};

# Exported subroutine that takes as arguments: a reference to the
# scalar in the calling script that will receive the disambiguated
# choice if the match is successful; a list of choices used to
# generate and compile a regular expression against which a choice
# will be validated. The subroutine returns the compiled regular
# expression to do the validation match.
#
#   ------------
sub disambiguate
#   ------------
{
    # Get scalar reference and list of choices. Validate.
    #
    my($rsChoice, @choices) = @_;
    croak "disambiguate(): arg. 1 not a SCALAR reference" unless
       ref $rsChoice eq "SCALAR";
    croak "disambiguate(): no choices supplied" unless
       scalar @choices;
    my %seen = ();
    foreach my $choice (@choices)
    {
        croak
           "disambiguate(): duplicate choice \"$choice\" found" if
           $seen{$choice} ++;
    }

    # Initialise a hash for choice fragments, to be keyed by choice,
    # e.g. "cle", "a" and "r" for the choice "clear" as above. And
    # also one for unambiguous stubs, e.g. "cle", "clea" and "clear"
    # where there will be elements for each, all with a value of
    # "clear". Initialise a list that will hold a regular expression
    # string for each choice.
    #
    my %fragments = ();
    my %unambiguities = ();
    my @choiceRegexen = ();

    # Iterate over the choices, split'ing into characters.
    #
    foreach my $choice (@choices)
    {
        my @chars = split //, $choice;

        # Initialise scalar to hold shortest unambiguous string for
    # this choice. Add a character at a time in a loop, grep'ing
    # the choices that match the possibly unambiguous string; if
    # you get more than one match, it is ambiguous.
    #
        my $unambiguous = q();
        while (@chars)
        {
            $unambiguous .= shift @chars;
            last unless 1 < grep {m/^$unambiguous/} @choices;
        }

        # We now have an unambiguous string (which could be the
    # entire choice) so push it and any remaining characters
    # onto the "fragments" hash for this choice. Update the
    # "unambiguities" hash for each unambiguous string up to
        # the full choice.
    #
        push @{$fragments{$choice}}, $unambiguous, @chars;
        $unambiguities{$unambiguous} = $choice;
        while (@chars)
        {
            $unambiguous .= shift @chars;
            $unambiguities{$unambiguous} = $choice;
        }

        # Call &$rcRegexForThis passing the fragments by reference
        # and push the resulting regular expression string onto
    # the list.
        #
    push @choiceRegexen,
       $rcRegexForThis->($fragments{$choice});
    }

    # We have now processed all of the choices so we can begin to
    # construct the compiled regular expression that will be
    # returned to the calling script. First build up the text.
    #
    # The first part is a code block that takes a reference to
    # the %unambiguities hash. Next comes anchor to the start
    # of the string and open a memory group. Then joining all of
    # the individual regular expressions for each choice with the
    # "|" symbol creates an alternation between the choices. After
    # that we close the memory group and anchor to end of string.
    # Finally, if the match was against an unambiguous string and
    # was successful, execute the code block to assign the full
    # name of the choice to the dereferenced scalar supplied in
    # the calling script by looking up what we matched ($1) in
    # the hash reference ($^R) from the result of the last code
    # block, a reference to %unambiguities.
    #
    my $regexText = '(?{\%unambiguities})';
    $regexText .= '^(';
    $regexText .= join '|', @choiceRegexen;
    $regexText .= ')$';
    $regexText .= '(?{$$rsChoice = $^R->{$1}})';

    # Declare and, use'ing re 'eval', compile the constructed
    # regular expression and return it to the calling script.
    #
    my $rxDisambiguate;
    {
        use re q(eval);
    $rxDisambiguate = qr{$regexText};
    }
    return $rxDisambiguate;
}

1;
[download]

and here is a script illustrating the module's use

# Turn on strictures and warnings. Get the module to sort
# out ambiguous answers.
#
use strict;
use warnings;

use Disambiguate;

# Initialise a list of choices, declare a scalar that will
# receive the disambiguated and completed choice.
#
my @choices = qw(
   alias
   allow
   appear
   apply
   begin
   clean
   clear
   clone
   close
   compare);
my $youChose;

# Call disambiguate() which is exported by Disambiguate.pm to
# generate a compiled regular expression to validate choices.
# Print it out just to see what is generated.
#
my $rxDisambiguate = disambiguate(\$youChose, @choices);
print "\nCompiled regular expression is\n\n$rxDisambiguate\n";

# Test choices in an infinite loop, Ctrl-D to exit. Use Ctrl-Z
# on Windows?
#
while(1)
{
    # Prompt user, drop out of loop if Ctrl-D entered. Chomp
    # answer given.
    #
    print
       "\nChoose from following, partial unambiguous entry OK\n",
       "   @choices\n",
       "Please choose (or Ctrl-D to exit) ? ";
    last if eof STDIN;
    chomp(my $ans = <STDIN>);

    # Does answer match regular expression returned by disambiguate()
    # above. If it does, completed choice was placed in $youChose by
    # the regular expression.
    #
    if($ans =~ $rxDisambiguate)
    {
        print "You chose: $youChose\n";
    }

    # It didn't match. Grep from @choices those that start with $ans.
    # If more than one grep'ed then $ans is ambiguous; if none then
    # $ans was not recognised. If only one was grep'ed then something
    # has gone badly wrong as the match above should have succeeded.
    #
    else
    {
        my @couldBe = grep {m/^$ans/} @choices;
        if(scalar @couldBe > 1)
        {
            print "\"$ans\" is ambiguous, ",
           "could be any one of - @couldBe\a\n";
        }
        elsif(scalar @couldBe == 0)
        {
            print "\"$ans\" not recognised\a\n";
        }
        else
        {
            die "It should not be possible to get here\n";
        }
    }
}
print "     \n";
[download]

Running the script gives output like this

Compiled regular expression is

(?-xism:(?{\%unambiguities})^(ali(?:a(?:s)?)?|all(?:o(?:w)?)?|appe(?:a
+(?:r)?)?|appl(?:y)?|b(?:e(?:g(?:i(?:n)?)?)?)?|clean|clear|clon(?:e)?|
+clos(?:e)?|co(?:m(?:p(?:a(?:r(?:e)?)?)?)?)?)$(?{$$rsChoice = $^R->{$1
+}}))

Choose from following, partial unambiguous entry OK
   alias allow appear apply begin clean clear clone close compare
Please choose (or Ctrl-D to exit) ? a
"a" is ambiguous, could be any one of - alias allow appear apply

Choose from following, partial unambiguous entry OK
   alias allow appear apply begin clean clear clone close compare
Please choose (or Ctrl-D to exit) ? al
"al" is ambiguous, could be any one of - alias allow

Choose from following, partial unambiguous entry OK
   alias allow appear apply begin clean clear clone close compare
Please choose (or Ctrl-D to exit) ? ali
You chose: alias

Choose from following, partial unambiguous entry OK
   alias allow appear apply begin clean clear clone close compare
Please choose (or Ctrl-D to exit) ? b
You chose: begin

Choose from following, partial unambiguous entry OK
   alias allow appear apply begin clean clear clone close compare
Please choose (or Ctrl-D to exit) ? d
"d" not recognised

Choose from following, partial unambiguous entry OK
   alias allow appear apply begin clean clear clone close compare
Please choose (or Ctrl-D to exit) ? ^D
[download]

The thing that is puzzling me is the compiled regular expression returned to the calling script. It starts (?-xism:(?{\%unambiguities}) ... but how does the calling script know anything about %unambiguities which was lexically scoped in the module? I expected to see something like HASH(0x8170c28) there instead. Would any Monks be able to throw some light on this.

Cheers,

JohnGG

Comment on Using a hash reference in a regular expression Select or Download Code


Pathologically Eclectic Rubbish Lister
	PerlMonks