Re: Re: Case in regular expressions

In the second edition of his book, Friedl drily announces that readers of the first edition need not worry any longer about the /i modifier, since the issue has been already fixed.

Here is a test, showing that the difference, if any, is rather small. Getting less than 0.20 sec difference with ten thousand iterations on a one-million-char string, I would choose the /i modifier any time.

It all depends on your version of Perl and your machine speed, but if you have a recent release of both, you can safely use the /i modifier without losing much sleep.

#!/usr/bin/perl -w
use strict;
use Benchmark qw(timethese);

for my $size (10_000, 100_000, 1_000_000) {
    my $string = 'a' x $size  . ' While ';
    print "string size $size\n", ;

    my $x;
    timethese(10_000, {
        i_modifier => sub {
            $x = 1 if $string =~  m/\bwhile\b/i;
        },
        char_class => sub {
            $x = 1 if $string =~ m/\b[Ww][Hh][Ii][Ll][Ee]\b/;
        }
    });
}

__END__
Perl 5.6.1
==========
string size 10000
Benchmark: timing 10000 iterations of char_class, i_modifier...
char_class:  1 wallclock secs ( 0.62 usr +  0.00 sys =  0.62 CPU)
i_modifier:  1 wallclock secs ( 0.62 usr +  0.00 sys =  0.62 CPU)

string size 100000
Benchmark: timing 10000 iterations of char_class, i_modifier...
char_class:  6 wallclock secs ( 6.05 usr +  0.01 sys =  6.06 CPU)
i_modifier:  6 wallclock secs ( 6.05 usr +  0.01 sys =  6.06 CPU)

string size 1000000
Benchmark: timing 10000 iterations of char_class, i_modifier...
char_class: 64 wallclock secs (62.17 usr +  0.31 sys = 62.48 CPU)
i_modifier: 63 wallclock secs (61.87 usr +  0.25 sys = 62.12 CPU)

ActiveState Perl 5.8.0 (optimized for Pentium architecture)
===========================================================
string size 10000
Benchmark: timing 10000 iterations of char_class, i_modifier...
char_class:  1 wallclock secs ( 0.51 usr +  0.00 sys =  0.51 CPU)
i_modifier:  1 wallclock secs ( 0.52 usr +  0.00 sys =  0.52 CPU)

string size 100000
Benchmark: timing 10000 iterations of char_class, i_modifier...
char_class:  5 wallclock secs ( 5.12 usr +  0.02 sys =  5.14 CPU)
i_modifier:  5 wallclock secs ( 5.23 usr +  0.02 sys =  5.25 CPU)

string size 1000000
Benchmark: timing 10000 iterations of char_class, i_modifier...
char_class: 51 wallclock secs (50.74 usr +  0.15 sys = 50.89 CPU)
i_modifier: 51 wallclock secs (50.91 usr +  0.15 sys = 51.06 CPU)
[download]

Comment on Re: Re: Case in regular expressions Download Code

Replies are listed 'Best First'.

Re: Re: Re: Case in regular expressions
by davido (Cardinal) on Sep 13, 2003 at 16:31 UTC

Your example my not produce as much of a "Worst Case Scenario" as Friedl's. He scanned a portion of the source code of his version of the 'C' compiler, which at the time was about a 1.1mb file, and certanly 'while' appeared earlier than the last position in the string, and probably appeared multiple times, amid a lot of other line noise and false starts, so to speak. But your example definately does show that the efficiency cost of /i has been greatly reduced. Your effort paid off.

I do know that the perldocs suggest that the $& penalty has been reduced in its scope and its severity to the point that it's a lot safer to use it. For one thing, it only affects the current regexp. $` and $' apparently are still much more costly.

I was thinking about the issue more again last night. It seems to me that under the older implementations, where /i was significantly more costly, its cost was roughly exponential to the size of the string it was being used on. Frankly, I have no idea what the actual big "O" notation would be for /i under the old implementations. But if I'm roughly accurate in asserting that the efficiency penalty was exponentially greater as a string grew in size, it makes sense to split strings up into smaller components. If scanning a 1.1mb string took 1.5 days, I'll bet that scanning eleven 100k strings would take only a fraction of that amount of time since the regexp engine simply wouldn't have as much to keep track of in each scan... it wouldn't get as bogged down in its own churning.

I believe that concept can be more generally applied to regular expressions. It is probably nearly always quicker to match 1mb as ten smaller strings than as one 1mb string, even with the additional overhead of cranking up the engine 10 times. This is all just personal theory, as I have yet to benchmark it. But when I do, I'll post my findings. Obviously there has to be some point at which it's just not beneficial to make the string any smaller. And at some point you also have to say, this is Perl, not hand-optimized machine code. Move on.

In the first edition of MRE, Friedl did suggest that he had no idea why the /i modifier had to be so costly. It was apparent to him that there was copying (of the string being scanned for matches) that simply didn't need to be there.

Also missing from the first edition are some of the newer, more experimental Regexp components, such as (?> .... ). I had to turn to the perldocs to figure out what it meant when I saw Abigail II use it the other day in a post.

Generally, it is safe to refer to the perldocs as the most up to date authority. The problem with respect to Regular Expressions is that Friedl's book is so much better than any of the online documentation, it is tempting to refer to it instead, and this time it tripped me up.

Thanks again for the update.

Dave

"If I had my life to do over again, I'd be a plumber." -- Albert Einstein

[reply]


We don't bite newbies here... much
	PerlMonks