Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Re: Case in regular expressions

by dbwiz (Curate)
on Sep 13, 2003 at 09:33 UTC ( [id://291229]=note: print w/replies, xml ) Need Help??


in reply to Re: Case in regular expressions
in thread Case in regular expressions

In the second edition of his book, Friedl drily announces that readers of the first edition need not worry any longer about the /i modifier, since the issue has been already fixed.

Here is a test, showing that the difference, if any, is rather small. Getting less than 0.20 sec difference with ten thousand iterations on a one-million-char string, I would choose the /i modifier any time.

It all depends on your version of Perl and your machine speed, but if you have a recent release of both, you can safely use the /i modifier without losing much sleep.

#!/usr/bin/perl -w use strict; use Benchmark qw(timethese); for my $size (10_000, 100_000, 1_000_000) { my $string = 'a' x $size . ' While '; print "string size $size\n", ; my $x; timethese(10_000, { i_modifier => sub { $x = 1 if $string =~ m/\bwhile\b/i; }, char_class => sub { $x = 1 if $string =~ m/\b[Ww][Hh][Ii][Ll][Ee]\b/; } }); } __END__ Perl 5.6.1 ========== string size 10000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 1 wallclock secs ( 0.62 usr + 0.00 sys = 0.62 CPU) i_modifier: 1 wallclock secs ( 0.62 usr + 0.00 sys = 0.62 CPU) string size 100000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 6 wallclock secs ( 6.05 usr + 0.01 sys = 6.06 CPU) i_modifier: 6 wallclock secs ( 6.05 usr + 0.01 sys = 6.06 CPU) string size 1000000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 64 wallclock secs (62.17 usr + 0.31 sys = 62.48 CPU) i_modifier: 63 wallclock secs (61.87 usr + 0.25 sys = 62.12 CPU) ActiveState Perl 5.8.0 (optimized for Pentium architecture) =========================================================== string size 10000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 1 wallclock secs ( 0.51 usr + 0.00 sys = 0.51 CPU) i_modifier: 1 wallclock secs ( 0.52 usr + 0.00 sys = 0.52 CPU) string size 100000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 5 wallclock secs ( 5.12 usr + 0.02 sys = 5.14 CPU) i_modifier: 5 wallclock secs ( 5.23 usr + 0.02 sys = 5.25 CPU) string size 1000000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 51 wallclock secs (50.74 usr + 0.15 sys = 50.89 CPU) i_modifier: 51 wallclock secs (50.91 usr + 0.15 sys = 51.06 CPU)

Replies are listed 'Best First'.
Re: Re: Re: Case in regular expressions
by davido (Cardinal) on Sep 13, 2003 at 16:31 UTC
    Thank you so much for the update. Having upgraded my version of Perl, and my computer several times since first acquiring MRE, it looks like its time to acquire the updated book too. ;)

    Your example my not produce as much of a "Worst Case Scenario" as Friedl's. He scanned a portion of the source code of his version of the 'C' compiler, which at the time was about a 1.1mb file, and certanly 'while' appeared earlier than the last position in the string, and probably appeared multiple times, amid a lot of other line noise and false starts, so to speak. But your example definately does show that the efficiency cost of /i has been greatly reduced. Your effort paid off.

    I do know that the perldocs suggest that the $& penalty has been reduced in its scope and its severity to the point that it's a lot safer to use it. For one thing, it only affects the current regexp. $` and $' apparently are still much more costly.

    I was thinking about the issue more again last night. It seems to me that under the older implementations, where /i was significantly more costly, its cost was roughly exponential to the size of the string it was being used on. Frankly, I have no idea what the actual big "O" notation would be for /i under the old implementations. But if I'm roughly accurate in asserting that the efficiency penalty was exponentially greater as a string grew in size, it makes sense to split strings up into smaller components. If scanning a 1.1mb string took 1.5 days, I'll bet that scanning eleven 100k strings would take only a fraction of that amount of time since the regexp engine simply wouldn't have as much to keep track of in each scan... it wouldn't get as bogged down in its own churning.

    I believe that concept can be more generally applied to regular expressions. It is probably nearly always quicker to match 1mb as ten smaller strings than as one 1mb string, even with the additional overhead of cranking up the engine 10 times. This is all just personal theory, as I have yet to benchmark it. But when I do, I'll post my findings. Obviously there has to be some point at which it's just not beneficial to make the string any smaller. And at some point you also have to say, this is Perl, not hand-optimized machine code. Move on.

    In the first edition of MRE, Friedl did suggest that he had no idea why the /i modifier had to be so costly. It was apparent to him that there was copying (of the string being scanned for matches) that simply didn't need to be there.

    Also missing from the first edition are some of the newer, more experimental Regexp components, such as (?> .... ). I had to turn to the perldocs to figure out what it meant when I saw Abigail II use it the other day in a post.

    Generally, it is safe to refer to the perldocs as the most up to date authority. The problem with respect to Regular Expressions is that Friedl's book is so much better than any of the online documentation, it is tempting to refer to it instead, and this time it tripped me up.

    Thanks again for the update.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://291229]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-19 17:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found