RolandGunslinger has asked for the wisdom of the Perl Monks concerning the following question:
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Case in regular expressions
by hardburn (Abbot) on Sep 12, 2003 at 17:31 UTC | |
Use the i modifier.
See also perlre. ---- Note: All code is untested, unless otherwise stated | [reply] [d/l] [select] |
Re: Case in regular expressions
by davido (Cardinal) on Sep 12, 2003 at 18:00 UTC | |
In many situations this is the right way to do it. But as Friedl points out in Mastering Regular Expressions (the Owls book), the /i modifier can be extremely costly if you're scanning through a lot of text. See the section called, "Perl Efficiency Issues" for details. The net result can be minimal on just a line or two of text, but as Friedl illustrates, in searching case insensitively for while m/./gi in a 1MB file (read as a single line), the match with /i took a day and a half to complete, whereas without /i, the match completed in 12 seconds.
Again, paraphrasing Friedl... My suggestion is that if you have to match on a small string of text in a case-insensitive way, use /i. But if the string is likely to be quite large, and efficiency matters to you, find an alternative to the /i modifier. Here is one possible alternative:
Admittedly this method makes a copy of $string. You could avoid that if you didn't mind converting $string itself to lc or uc. But the point is that the /i operator actually can cause multiple copies of the same string to be made and later discarded. In a worst case scenario, a 1MB string that Friedl used had over 600MB of data being copied around by the regexp engine as it tried to match while applying the /i modifier. In a real-world case, the penalty of using /i is much smaller. But just as we take notice any time $&, $`, and $' are used, take notice whenever you use /i. Thanks to Friedl's Mastering Regular Expressions book, we don't all have to test the /i switch on huge files to verify its efficiency; we can take his word for it. He's done all the research on the subject we need. /i is a tool, and is there to be used, just as $&, $`, and $'. Clearly its use is not deprecated. But it is a tool that comes at perhaps a higher efficiency cost than unsuspecting users might imagine. Understand the ramifications, and then plan your code accordingly.
Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein | [reply] [d/l] [select] |
by dbwiz (Curate) on Sep 13, 2003 at 09:33 UTC | |
In the second edition of his book, Friedl drily announces that readers of the first edition need not worry any longer about the /i modifier, since the issue has been already fixed. Here is a test, showing that the difference, if any, is rather small. Getting less than 0.20 sec difference with ten thousand iterations on a one-million-char string, I would choose the /i modifier any time. It all depends on your version of Perl and your machine speed, but if you have a recent release of both, you can safely use the /i modifier without losing much sleep.
| [reply] [d/l] |
by davido (Cardinal) on Sep 13, 2003 at 16:31 UTC | |
Your example my not produce as much of a "Worst Case Scenario" as Friedl's. He scanned a portion of the source code of his version of the 'C' compiler, which at the time was about a 1.1mb file, and certanly 'while' appeared earlier than the last position in the string, and probably appeared multiple times, amid a lot of other line noise and false starts, so to speak. But your example definately does show that the efficiency cost of /i has been greatly reduced. Your effort paid off. I do know that the perldocs suggest that the $& penalty has been reduced in its scope and its severity to the point that it's a lot safer to use it. For one thing, it only affects the current regexp. $` and $' apparently are still much more costly. I was thinking about the issue more again last night. It seems to me that under the older implementations, where /i was significantly more costly, its cost was roughly exponential to the size of the string it was being used on. Frankly, I have no idea what the actual big "O" notation would be for /i under the old implementations. But if I'm roughly accurate in asserting that the efficiency penalty was exponentially greater as a string grew in size, it makes sense to split strings up into smaller components. If scanning a 1.1mb string took 1.5 days, I'll bet that scanning eleven 100k strings would take only a fraction of that amount of time since the regexp engine simply wouldn't have as much to keep track of in each scan... it wouldn't get as bogged down in its own churning. I believe that concept can be more generally applied to regular expressions. It is probably nearly always quicker to match 1mb as ten smaller strings than as one 1mb string, even with the additional overhead of cranking up the engine 10 times. This is all just personal theory, as I have yet to benchmark it. But when I do, I'll post my findings. Obviously there has to be some point at which it's just not beneficial to make the string any smaller. And at some point you also have to say, this is Perl, not hand-optimized machine code. Move on. In the first edition of MRE, Friedl did suggest that he had no idea why the /i modifier had to be so costly. It was apparent to him that there was copying (of the string being scanned for matches) that simply didn't need to be there. Also missing from the first edition are some of the newer, more experimental Regexp components, such as (?> .... ). I had to turn to the perldocs to figure out what it meant when I saw Abigail II use it the other day in a post. Generally, it is safe to refer to the perldocs as the most up to date authority. The problem with respect to Regular Expressions is that Friedl's book is so much better than any of the online documentation, it is tempting to refer to it instead, and this time it tripped me up. Thanks again for the update.
Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein | [reply] |
by Cody Pendant (Prior) on Sep 13, 2003 at 05:30 UTC | |
It's not fixed in any Perl 5 version? Will it be fixed in Perl 6?
| [reply] [d/l] |
by bradcathey (Prior) on Sep 13, 2003 at 01:18 UTC | |
| [reply] |
Re: Case in regular expressions
by tcf22 (Priest) on Sep 12, 2003 at 17:32 UTC | |
- Tom | [reply] [d/l] [select] |