Common Regex Gotchas

Perl novices often stumble over a few gotchas when first learning regular expressions. Learning the whys and the workarounds could save you hours of frustration.

Greediness

Perl's regex engine likes to match the longest string possible, by default. This is described as greediness. Most people don't think that way, at least when looking at text. Given the following string and regex, what will be in $1?

my $data = "<tag>this is a line of code</tag>
<explanation>this is where I wax poetic about my code</explanation>
<tag>this is another example of code</tag>";
if ($data =~ /<tag>(.*)<\/tag>/s) {
    print "I found =>$1<=\n";
}
[download]

If you said "this is a line of code", you're thinking the same thing most people do. Unfortunately, that's not the way Perl thinks:

I found =>this is a line of code</tag>
<explanation>this is where I wax poetic about my code</explanation>
<tag>this is another example of code<=
[download]

The secret lies in the mysterious asterisk (match zero or more of the preceding). When the engine hits it, it jumps ahead to the end of the line and tries to match the next character -- the < character. Since the last character in the string is >, the match fails, and the engine backtracks a character. This continues through e, d, o, c, and /, until it finally reaches the final < in $data.

Knowing that, you now understand the danger of greediness (and, hopefully, also why parsing HTML with a regex can be tricky). The solution is very simple:

 if ($data =~ /<tag>(.*?)<\/tag>/s) {
    print "I found =>$1<=\n";
}
[download]

Using the ? after a normally-greedy quantifier (* or +) tells the engine not to grab the longest string, but the first string that matches the whole pattern.

Specifying Too Much

This gotcha is more stylistic, but it can come back to haunt you later. Remember that regular expressions can be somewhat vague -- you don't have to specify the entire line, if you're only looking for a certain portion. Suppose that you want to find the word Serial, followed by a colon and then a nine-digit number. The data lines might look like this: my $line = "Name: Some Soldier, Rank: Leftenant, Serial: 426879824, Boots: black"; A regex novice might bite off more than he could chew with the following: $line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$/; If all you're interested in is the Serial number, only ask for that. It'll make your regex simpler, and it will handle deviations from what you think the line ought to look like. (That happens more often than you want to think.) $line =~ /[Ss]erial: (\d{9})/; Caveat: There are good reasons to break my Rule of Simplicity. Performance is one, and error handling is another. Be sure that the code works first, though, then try to make it tricky.

Special Characters

Don't forget that certain characters (like ., *, /, +, and ?) have special meanings within regular expressions. If you don't have a Unixy background (where escaping characters with a backslash is a little more common), you might write something like this, and stare at it in confusion for a while: $line =~ /<title>(.*?)</title>/; Hmm. Check the perlman:perlre page for the skinny on exactly which characters have special meaning. Also be aware that choosing alternate delimiters can help out, as well as being more visually appealing: $line =~ m!<title>(.*?)</title>!; One other caveat is that, within a character class, these rules often don't apply:

my $line = "a.b.cd*f.";
$line =~ /([^.*]{2})/;
[download]

Simple Substitutions

Want to make sure user input is completely uppercased? Here's one approach:

my $input = "foo bar baz";
$input =~ s/(\w+)/uc($1)/ge;
[download]

While that works, it's serious overkill. Even a less picky approach is sub-optimal: $intput =~ s/([A-Za-z]+)/uc($1)/ge; Don't forget about the friendly tr/// operator -- it's made for simple substitutions like this. (Of course, if you're working with a locale different than simple English text, you're out of luck). $input =~ tr/a-z/A-Z/;

Regular expressions give you a lot of power at the cost of some speed. Don't get out the chainsaw when a penknife will do.

Update: a few small corrections.

Comment on Common Regex Gotchas Select or Download Code

Replies are listed 'Best First'.

Re: Common Regex Gotchas
by Desdinova (Friar) on Mar 14, 2001 at 22:41 UTC

#!/usr/local/bin/perl -w
use strict;
use Benchmark;
my $count =500000;
## Method number one
sub One {
   my $data = 'for bar baz';
   $data = uc $data;

}

## Method number two
sub Two {
   my $data = 'for bar baz';
   $data =~ tr/a-z/A-Z/;
}
## Method number Three
sub Three {
   my $data = 'for bar baz';
   $data =~ s/([A-Za-z]+)/uc($1)/ge;
}
## We'll test each one, with simple labels
timethese (
  $count,
  {'Method One UC' => '&One',
   'Method Two TR' => '&Two',
   'Method Three s'=> '&Three'
   }
);

exit;
[download]

Benchmark: timing 500000 iterations of Method One UC, Method Three s, 
+Method Two TR...
Method One UC:  1 wallclock secs ( 1.42 usr +  0.00 sys =  1.42 CPU) @
+ 352112.68/s (n=500000)
Method Three s: 16 wallclock secs (17.03 usr +  0.00 sys = 17.03 CPU) 
+@ 29359.95/s (n=500000)
Method Two TR:  1 wallclock secs ( 2.04 usr +  0.00 sys =  2.04 CPU) @
+ 245098.04/s (n=500000)
[download]

Benchmarking your code

UPDATE:

Xxaxx

This Node

 my $data = 'for-bar-baz';
   $data =~ s/-/_/g;
   print $data;
 my $data = 'for-bar-baz';
 $data =~tr/-/_/;
 print $data;
[download]

Benchmark: timing 500000 iterations of Method One TR, Method Two s...
Method One TR:  2 wallclock secs ( 1.87 usr +  0.00 sys =  1.87 CPU) @
+ 267379.68/s (n=500000)
Method Two s:  5 wallclock secs ( 4.84 usr +  0.00 sys =  4.84 CPU) @ 
+103305.79/s (n=500000)
[download]

Update 2:

petral

[reply]
[d/l]
[select]

Re: Re: Common Regex Gotchas

by Anonymous Monk on May 08, 2001 at 23:26 UTC

If I am not mistaken the Benchmark module is plagued by the "$& and friends". That means it makes the regexes slow by defualt. That means that the benchmarks you take are disproportionate and useless, since the ineffectiant single instance of $& ruins any optimizations perl can make on the substitution.

[reply]

Re: Re: Re: Common Regex Gotchas

by chipmunk (Parson) on May 09, 2001 at 00:30 UTC

The real problem here is the use of /e on the substitution, when this would work just as well and be much more efficient: s/(\w+)/\U$1/g;

[reply]
[d/l]

Re: Re: Re: Re: Common Regex Gotchas

by dws (Chancellor) on May 09, 2001 at 00:32 UTC

Re: Re: Re: Re: Re: Common Regex Gotchas

by chipmunk (Parson) on May 09, 2001 at 00:37 UTC

Re: Common Regex Gotchas
by John M. Dlugosz (Monsignor) on Jul 06, 2001 at 07:47 UTC

tr/a-z/A-Z/

uc

\U

[reply]
[d/l]
[select]

Re: Common Regex Gotchas -- "(:?"
by shenme (Priest) on Sep 29, 2005 at 18:28 UTC

When extending the regex syntax to include features like zero-width negative look-ahead the authors tried very hard to use syntax that avoided duplicating any 'real' regex code. So they started all the new syntax with '(?'. It turns out that this makes typos a bit too easy, and far too quiet.

I came across the following in a CPAN module:

^(:?(:?$\d\d\d$)?\s*\d\d)?\d[-.\s]?\d\d\d\d$

The writer intended to use "(?:", the clustering grouping. This is used when you need to avoid capturing the matched subexpression. For instance you might want to say that a complex inner match is optional, e.g.

... ( contains \s+ (?:this|that)? \s+ item ) ...

But tyops happen. What is the result if you reverse the ':' and '?' characters? Nothing drastic, usually.

In "(:? pattern )" the original meaning of '?' is used - the ':' character becomes an optionally matched character. The parentheses also revert to their original meaning of capturing groups.

So usually the only result is that the regex is a bit slower and captures more substrings. It might also allow a stray ':' input character. If you weren't monitoring how many captures come back from a successful match you might never notice the typo.

But note that this typo could occur with any single character "(?X" syntax. You might notice it right away if your "(#? comment )" caused syntax errors. And you should notice it when your input matching tests fail on "fore(=?fend)". But otherwise these typos will silently fail.

Now this is a minor gotcha. Except that it is found in 15 nodes here, with another node mentioning it in an aside, and another node discovering the typo in a book. I wonder if it is in your code?

perlre - Extended Patterns

[reply]

Regex Lab 4 Studying
by chanio (Priest) on Jul 06, 2005 at 06:32 UTC

The Regex Coach

Working on LINUX would be easier for Regular Expression practices since nearly all works like this. But on Windows, besides the great Bundle::PPT (A bundle for install perl power tools that emulate LINUX shell commands) there are very reduced ways of dealing with REGEX. And oneliners are good, if you could be sure that the rest of the code is not affecting the REGEX to be tested.

This Regex Coach was programmed in Common Lisp. And explains a lot about any regular expression that I could try...

It not only resolves matching, substitutions and splitting, but also makes a short description of the regex to test. It shows the mistakes and draws a tree with the possible paths considered by the evaluation. There is even a step by step traversing of the main decisions that the regex engine does. And highlights the chars that are involved on each one. It also allows to mask some part of the text to be filtered, to see the efects of the regular expression. When each step is done, a part of the text is highlited, as well as the part of the code involved in that moment. People should try this freeware. And thank the author, to encourage him to keep on improving it... If it is possible!

{

(

)

}

(

)

(

)

(

)

(

)

Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established

[reply]

Re: Common Regex Gotchas
by Anonymous Monk on May 28, 2001 at 17:50 UTC

$line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$";

This ends the regexp with a `"' and starts it with `/`. you could do m"..." or /.../ .

[reply]

Re: Re: Common Regex Gotchas

by Anonymous Monk on Jan 19, 2002 at 09:55 UTC

your allowed to do that with regexs? I just learned you could use [], (), "", <>, etc., with qq...

I should read more on Perl.

[reply]

Re: Common Regex Gotchas
by Anonymous Monk on Nov 27, 2001 at 03:23 UTC

I found the Greedy section to be quite confusing. First of all, I think you probably have a couple slash-s'es in your html that is causing it to not print and makes the last couple paragraphs very difficult to understand and read. Also, I'm still confused about why there is a match at all in the first example. Why doesn't the engine continue backwards past the whitespace and look for a </tag> string? Finally, why does the last example (still in the Greedy section) work? If, when creating the example string, I carraige return after the </tag>, there shouldn't be a whitespace to match on, right? Finally, finally, thanks for putting this together... it's really speeding my ramp along...

[reply]

Re: Re: Common Regex Gotchas

by Anonymous Monk on Nov 27, 2001 at 03:36 UTC

Also, I'm still confused about why there is a match at all in the first example. Why doesn't the engine continue backwards past the whitespace and look for a <\/tag> string?

Finally, why does the last example (still in the Greedy section) work? If, when creating the example string, I carraige return after the <\/tag>, there shouldn't be a whitespace to match on, right?

Finally, finally, thanks for putting this together... it's really speeding my ramp along...

[reply]

Re: Re: Re: Common Regex Gotchas

by chromatic (Archbishop) on Nov 27, 2001 at 04:47 UTC

Why doesn't the engine continue backwards past the whitespace and look for a <\/tag> string?

Because the engine prefers the longest match that starts at the leftmost possible position. When it hits .*, it jumps all the way to the end of the string and then backtracks, trying to match the next necessary character. Because it's backtracking, it matches </tag> at the end of the string. That fits the pattern, so it doesn't continue backtracking to find a shorter match.

If, when creating the example string, I carraige return after the <\/tag>, there shouldn't be a whitespace to match on, right?

The /s flag allows the '.' token to match newlines. Adding the minimal token '?' avoids the jump-to-end-then-backtrack behavior. It works like you'd expect, trying to match as few characters as possible.

Does that clear it up? I've also touched up the formatting somewhat.

[reply]

Re: Common Regex Gotchas
by mrpeabody (Friar) on May 04, 2004 at 04:11 UTC

If you don't have a Unixy background (where escaping characters with a forwardslash is a little more common),

[reply]

Re: Common Regex Gotchas
by Anonymous Monk on Feb 11, 2005 at 09:57 UTC

Thanks for the great but simple explanations - you write really well. I'm a relative beginner at regex's so I really appreciate it.

[reply]

Re: Common Regex Gotchas
by rovf (Priest) on Jul 07, 2008 at 08:59 UTC

Maybe it's worth to include the following, which has bitten me once too: One common use is to get regexp from an external source - for instance, your commandline - into your program, so you end up having somewhere something like:

if($line =~ /$pattern/) { ... }
[download]

//

split

--
Ronald Fischer <ynnor@mm.st>

[reply]
[d/l]
[select]

Username:
Password:


No such thing as a small change
	PerlMonks

This is PerlMonks "Mobile"

Common Regex Gotchas

Greediness

Specifying Too Much

Special Characters

Simple Substitutions