No such thing as a small change | |
PerlMonks |
my $data = "<tag>this is a line of code</tag> <explanation>this is where I wax poetic about my code</explanation> <tag>this is another example of code</tag>"; if ($data =~ /<tag>(.*)<\/tag>/s) { print "I found =>$1<=\n"; }
If you said "this is a line of code", you're thinking the same thing most people do. Unfortunately, that's not the way Perl thinks:
I found =>this is a line of code</tag> <explanation>this is where I wax poetic about my code</explanation> <tag>this is another example of code<=
The secret lies in the mysterious asterisk (match zero or more of the preceding). When the engine hits it, it jumps ahead to the end of the line and tries to match the next character -- the < character. Since the last character in the string is >, the match fails, and the engine backtracks a character. This continues through e, d, o, c, and /, until it finally reaches the final < in $data.
Knowing that, you now understand the danger of greediness (and, hopefully, also why parsing HTML with a regex can be tricky). The solution is very simple:
if ($data =~ /<tag>(.*?)<\/tag>/s) { print "I found =>$1<=\n"; }
Using the ? after a normally-greedy quantifier (* or +) tells the engine not to grab the longest string, but the first string that matches the whole pattern.
my $line = "a.b.cd*f."; $line =~ /([^.*]{2})/;
While that works, it's serious overkill. Even a less picky approach is sub-optimal: $intput =~ s/([A-Za-z]+)/uc($1)/ge; Don't forget about the friendly tr/// operator -- it's made for simple substitutions like this. (Of course, if you're working with a locale different than simple English text, you're out of luck). $input =~ tr/a-z/A-Z/;my $input = "foo bar baz"; $input =~ s/(\w+)/uc($1)/ge;
Regular expressions give you a lot of power at the cost of some speed. Don't get out the chainsaw when a penknife will do.
Update: a few small corrections.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Common Regex Gotchas
by Desdinova (Friar) on Mar 14, 2001 at 22:41 UTC | |
by Anonymous Monk on May 08, 2001 at 23:26 UTC | |
by chipmunk (Parson) on May 09, 2001 at 00:30 UTC | |
by dws (Chancellor) on May 09, 2001 at 00:32 UTC | |
by chipmunk (Parson) on May 09, 2001 at 00:37 UTC | |
Re: Common Regex Gotchas
by John M. Dlugosz (Monsignor) on Jul 06, 2001 at 07:47 UTC | |
Re: Common Regex Gotchas -- "(:?"
by shenme (Priest) on Sep 29, 2005 at 18:28 UTC | |
Regex Lab 4 Studying
by chanio (Priest) on Jul 06, 2005 at 06:32 UTC | |
Re: Common Regex Gotchas
by Anonymous Monk on May 28, 2001 at 17:50 UTC | |
by Anonymous Monk on Jan 19, 2002 at 09:55 UTC | |
Re: Common Regex Gotchas
by Anonymous Monk on Nov 27, 2001 at 03:23 UTC | |
by Anonymous Monk on Nov 27, 2001 at 03:36 UTC | |
by chromatic (Archbishop) on Nov 27, 2001 at 04:47 UTC | |
Re: Common Regex Gotchas
by mrpeabody (Friar) on May 04, 2004 at 04:11 UTC | |
Re: Common Regex Gotchas
by Anonymous Monk on Feb 11, 2005 at 09:57 UTC | |
Re: Common Regex Gotchas
by rovf (Priest) on Jul 07, 2008 at 08:59 UTC |