Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

regex problem

by Anonymous Monk
on Feb 09, 2002 at 20:18 UTC ( [id://144388]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings. Hopefully this is not too off-topic.

I have to automatically create regexes that do _not_ match certain words.
The problem is, I cannot use a construct like:

$_ !~ m/(foo|bar)/
for the application does something like:
$_ =~ m/$pattern/
and the only thing I can manipulate is $pattern.

So my question is: Is there an algorithm that can construct these regexes from a given wordlist?
Is that possible and if yes, how expensive would the regex be(performancewise) ?

Thanks for your thoughts!

Replies are listed 'Best First'.
Word Exclusion Regex (was Re: regex problem)
by japhy (Canon) on Feb 09, 2002 at 21:46 UTC
    Well, a general principle is this:
    # $re = exclude(@words); sub exclude { my %words; push @{ $words{ quotemeta substr($_, 0, 1) } }, quotemeta substr($_, 1) for @_; my $first = "[^@{[ join '', keys %words ]}]*"; my $rest = join "|", map "$_(?!" . join("|", @{ $words{$_} }) . ")", keys %words; return qr/^$first(?:(?:$rest)$first)*$/; } my $re = exclude(qw( this that those )); # print $re; # for debugging purposes for ("I like this", "give me that one", "these rock!") { print "$_ => " . /$re/; }

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      As impressive as this is (and I haven't got it entirely figured out yet) there are a couple bugs. $first contain extra spaces when the group includes words that start with different letters... localize $" or just do a boring join to fix that. Also, words with multiple occurances of the first letter ('aabc' instead of 'abc') get excluded even when they shouldn't.

      The following output shows several incorrect cases using an exclude list of qw(dog cat pig):

      (?-xism:^[^p c d]*(?:(?:p(?!ig)|c(?!at)|d(?!og))[^p c d]*)*$) dog => cat => pig => owl => 1 ddog => ccat => ppig => pdog => pcat => elephant => ppppcatgggg =>

      -Blake

        Oops, the original version used join() when creating $first. I don't know why I changed it. As for the other complaint, the regex is designed to ensure the words don't appear at all. If you only wanted a regex that didn't match a string that is a set of words, it would look much simpler: /^(?!(?:cat|dog|pig)$)/. That's not what I was going for.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: regex problem
by Juerd (Abbot) on Feb 09, 2002 at 21:24 UTC
    Use a negative look-ahead assertion, see perlre for information about that.

    $pattern = '(?!.*foo)';


    It is, however, very expensive. If changing the code is possible, do so.

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

      Yikes! /(?!.*foo)/ matches ALL strings! I'm not sure this pattern is meaningful. /bar(?!foo)/ matches strings containing 'bar' that aren't followed by 'foo'. The problem is that matching a negative pattern is a subtle problem. There's a reason for !~. The person who posed this question is S.O.L. The application needs to branch to handle negative patterns. That is the easiest and most readable solution. The other thing to consider is how this pattern is formed at all. If the poster is getting these patterns off the command line or through CGI, it's unwise to directly run that pattern. Remember that Perl regexes can (now) execute arbitrary Perl code. For instance:
         # don't run this
         $pattern = '(?{`rm -rf *`})'; 
         $str =~ /$pattern/
      
      The beauty here is that no match is required for the perl code to be run.

      What's my point? The poster needs better control over incoming patterns anyway, so adding a little branching logic for negative matches shouldn't be burdensome.

        Your regex there is safe, unless you've turned on use re 'eval'. Perl stops you from executing regexes with evaluations in them from variables.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who could use a job
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: regex problem
by djantzen (Priest) on Feb 09, 2002 at 21:22 UTC

    Well, there are two differences between your examples -- the use of variables and the switch from !~ to =~. Is the latter a typo? If not, then you can accomplish the same thing by using [^$pattern]. In any case, your real question is whether regexes can be constructed, and the answer is yes. It's simply a matter of writing it as a normal string, and using perlfunc:eval to compile at runtime. (Note, you may have to do some character escaping (using '\') in your original string if it gets at all complicated.) For example: my $string = "$foo|$bar"; for ( list ) { eval !~ /$string/; }

    As far as expense goes, I can't see any reason why this would be particularly taxing.

    Update Juerd++ for the correction regarding [^$pattern]. My mistake.

      If not, then you can accomplish the same thing by using [^$pattern].

      [^$pattern] is misleading. [] create a character class, so [^$characters] would be a good example. That doesn't negate a regex, though.

      It's simply a matter of writing it as a normal string, and using perlfunc:eval to compile at runtime.

      That requires parsing. Parsing is bad, because it's too easy to do it the wrong way. This too doesn't really answer the question.

      eval !~ /$string/;

      Unless you meant eval($_) !~ /$string/;, and I'm sure you didn't, that's wrong.

      2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://144388]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-25 10:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found