Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Thank you so much for the update. Having upgraded my version of Perl, and my computer several times since first acquiring MRE, it looks like its time to acquire the updated book too. ;)

Your example my not produce as much of a "Worst Case Scenario" as Friedl's. He scanned a portion of the source code of his version of the 'C' compiler, which at the time was about a 1.1mb file, and certanly 'while' appeared earlier than the last position in the string, and probably appeared multiple times, amid a lot of other line noise and false starts, so to speak. But your example definately does show that the efficiency cost of /i has been greatly reduced. Your effort paid off.

I do know that the perldocs suggest that the $& penalty has been reduced in its scope and its severity to the point that it's a lot safer to use it. For one thing, it only affects the current regexp. $` and $' apparently are still much more costly.

I was thinking about the issue more again last night. It seems to me that under the older implementations, where /i was significantly more costly, its cost was roughly exponential to the size of the string it was being used on. Frankly, I have no idea what the actual big "O" notation would be for /i under the old implementations. But if I'm roughly accurate in asserting that the efficiency penalty was exponentially greater as a string grew in size, it makes sense to split strings up into smaller components. If scanning a 1.1mb string took 1.5 days, I'll bet that scanning eleven 100k strings would take only a fraction of that amount of time since the regexp engine simply wouldn't have as much to keep track of in each scan... it wouldn't get as bogged down in its own churning.

I believe that concept can be more generally applied to regular expressions. It is probably nearly always quicker to match 1mb as ten smaller strings than as one 1mb string, even with the additional overhead of cranking up the engine 10 times. This is all just personal theory, as I have yet to benchmark it. But when I do, I'll post my findings. Obviously there has to be some point at which it's just not beneficial to make the string any smaller. And at some point you also have to say, this is Perl, not hand-optimized machine code. Move on.

In the first edition of MRE, Friedl did suggest that he had no idea why the /i modifier had to be so costly. It was apparent to him that there was copying (of the string being scanned for matches) that simply didn't need to be there.

Also missing from the first edition are some of the newer, more experimental Regexp components, such as (?> .... ). I had to turn to the perldocs to figure out what it meant when I saw Abigail II use it the other day in a post.

Generally, it is safe to refer to the perldocs as the most up to date authority. The problem with respect to Regular Expressions is that Friedl's book is so much better than any of the online documentation, it is tempting to refer to it instead, and this time it tripped me up.

Thanks again for the update.

Dave

"If I had my life to do over again, I'd be a plumber." -- Albert Einstein


In reply to Re: Re: Re: Case in regular expressions by davido
in thread Case in regular expressions by RolandGunslinger

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-03-28 13:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found