Deriving Regular Expressions

I recently had cause for deriving sets of regular expressions from a set of raw data and then using those regexps to extract a subset of the data -- sort of a uniq on steroids, if you will.

As usual there were some surprising complications along the way. In this case it was my ignorance about how the quotemeta() function actually works. This function is the same function employed when you escape meta-characters in double-quoted strings with the \Q and \E delimiters.

For the record, the documentation for quotemeta() reads, in part:

Returns the value of EXPR with all non-"word" characters backslashed. (That is, all characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.) This is the internal function implementing the \Q escape in double-quoted strings.

(for more info on the details, see the Gory details of parsing quoted constructs in the perldocs)

Now consider the following problem. Derive patterns, at varying degrees of generality, that would match the following string (not the quotes): ' ab+cd (12)34'

Here are some answers, in order of increasing generality:

   qr/^  ab\+cd   \(12\)34$/
   qr/^\s{2}[a-z]{2}\+[a-z]{2}\s{3}\(\d{2}\)\d{2}$/
   qr/^\s+[a-z]+\+[a-z]+\s+\(\d+\)\d+$/
   qr/^\s+\S+\s+\S+$/
[download]

In my case, at one point I was interested in maintaining sequence lengths of alphanumerics but collapsing whitespace. However, the strings were likely to have special characters in them, like the '+' and parenthesis in the example above. So before I did anything I needed to escape those special characters before proceeding, since I wanted to avoid escaping meta-characters later in the process. Little known to me, however, was the fact that quotemeta() escapes all non-word characters -- that includes whitespace. My first quick approach looked like the following. Order is important here -- if we replaced digits and spaces first they would be mangled by our alpha-character replacement later:

   $str = '  ab+cd   (12)34';
   $str = quotemeta($str);
   $str =~ s/[a-z]/\[a-z\]/ig;
   $str =~ s/\d/\\d/g;
   $str =~ s/\s+/\\s\+/g;
   $pat = qr/$str/i;
   print "$pat\n";

   Output:
   (?i-xsm:\\s+\\s+[a-z][a-z]\+[a-z][a-z]\\s+\\s+\\s+\(\d\d\)\d\d)
[download]

Oops. What happened there? All of the whitespace clusters are now a literal backslash followed by one or more 's' characters. Not only that but there are more of them than there should be. That won't do.

Had I read the quotemeta documentation I would have known that the first time I escaped the string the spaces were each individually escaped since they aren't word characters. Drat.

Hence the solution that worked for me:

   $str = '  ab+cd   (12)34';
   $str = quotemeta($str);
   $str =~ s/\\ / /g;
   $str =~ s/[a-z]/\[a-z\]/ig;
   $str =~ s/\d/\\d/g;
   $str =~ s/\s+/\\s\+/g;
   $pat = qr/$str/i;
   print "$pat\n";

   Output:
   (?i-xsm:\s+[a-z][a-z]\+[a-z][a-z]\s+\(\d\d\)\d\d)
[download]

Much better. That's what I was looking for. Applied to the original data this new pattern would match, whereas the other would not.

This same problem applies to any non-word characters in a string. In this case it happened to be whitespace. The crux of the problem is this: "escape all special characters in what is to eventually become a regular expression -- if a character is normally interpreted as literal in a regexp then do not escape it." Are there more effective ways of deriving regexp patterns out there? I'm interested in hearing about them.

Matt

Update: Fixed as per japhy's suggestion. That's what I get for trying to simplify things. In this case it was important to keep alphas distinct from numerics.

Comment on Deriving Regular Expressions Select or Download Code

Replies are listed 'Best First'.
Re: Deriving Regular Expressions by japhy (Canon) on Apr 18, 2002 at 21:27 UTC
You might want to do `s/\d/\\d/g` before you do `s/\w/\\w/g`... `\w` includes `\d`. As for the process of generating regexes from strings, I don't go near it. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a (from-home) job `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Deriving Regular Expressions (boo) by boo_radley (Parson) on Apr 19, 2002 at 00:39 UTC
I can't help but think japhy's right, but... `(?i-xsm:\s+[a-z][a-z]\+[a-z][a-z]\s+$\d\d$\d\d)` look at all those repeated elements! yeck. here's a rework which allows the use of {n}, or optionally (add a true parameter when calling the script) breaks all groups down into \d+ or [a-z]+ groups. Finally, double checks that the produced pattern does match the original string... I was pondering making some handler for \W characters, but that's probably too much for such a tchotchke... `$anynum=$ARGV[0]; chomp($orig=<STDIN>); $i = quotemeta($orig); $i=~ s/[a-z]/l/g; $i =~ s/\d/n/g; while ($i=~/(l+)/g){ if ($anynum) { $i=~s/($1)/'[a-z]+'/e; } else { $i=~s/($1)/"[a-z]{".length ($1)."}"/e; } } while ($i=~/(n+)/g){ if ($anynum) { $i=~s/($1)/"\\d+"/e; } else { $i=~s/($1)/"\\d{".length ($1)."}"/e; } } $i=~s/\s+/s\+/g; $i=qr/$i/; print "pattern is : $i\n"; print $orig=~/$i/; #double check` [download]	[reply] [d/l]
Re: Deriving Regular Expressions by I0 (Priest) on Apr 19, 2002 at 05:38 UTC
# my favorite cause for generating regexes from strings is to match balanced text: `$_ = "sin(atan2(sin(1),cos(1))),atan2(1,1),cos(atan2(sin(2),cos(2)))\n +"; (my $re=$_)=~s/(($)\|($)\|.)/${[')','']}[!$3]\Q$1\E${['(','']}[!$2]/gs +; $re= join'\|',map{quotemeta}eval{/$re/}; die $@ if $@ =~ /unmatched/; print while s/(\w+$($re)$)/$1/ee;` [download]	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks