Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: In search of an efficient query abstractor

by ikegami (Patriarch)
on Dec 07, 2008 at 17:59 UTC ( [id://728756]=note: print w/replies, xml ) Need Help??


in reply to In search of an efficient query abstractor

The regexes do a lot of backtracking and stuff.

Which one? I don't see anywhere that backtracking could occur. Everything there's a choice, only one path can possibly match.

$query =~ s{ (?<![\w.+-]) [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b } {N}gx; # Float/real into N

The trailing \b is buggy.

"12345 " -> "N " "12345. " -> "N. "

It matched "12345." until the "\b" was reached, then it backtracked to release the "." from the match. Oops, I guess there is backtracking, but only because of the bug.

Replies are listed 'Best First'.
Re^2: In search of an efficient query abstractor
by xaprb (Scribe) on Dec 07, 2008 at 20:36 UTC

    Hmmm. Good catch, although I thought I had a test case to ensure that was handled correctly. I'll check that :) You seem to know more about regexes than I do. Why would you guess this particular regex is slow? Line-by-line profiling proves that it is consuming the vast majority of the time. It consumes almost 300 CPU seconds, and the next most expensive line in this code consumes 89 seconds, on an 8GB file.

    The next-most-worst offender is

    $query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg;

    followed by

    $query =~ s/\s{2,}/ /g;

      Are you doing the whole 8GB file at once? If your string starts with "a b c d", $query =~ s/\s{2,}/ /g; needs to copy 32GB of text. Just for the first 10 characters.

        No, it's done one entry at a time. Each entry is a header with some commented lines, followed by a query. There are special cases, but it generally looks like

        # Time: 071015 21:43:52 # User@Host: root[root] @ localhost [] # Query_time: 2 Lock_time: 0 Rows_sent: 1 Rows_examined: 0 use test; select sleep(2) from n;

      At least for the last case, Perl has an optimized version:

      $query =~ tr[ ][]s;

      and it should be faster or at least as fast as the s/// version. Another version to try would be s/\s+/ /g - there is no need to use the counting variant of {2,}, and skipping might be slower than just writing the output "replacement".

        $query =~ tr[ \n\t\r\f][ ]s; turns out to be a lot faster than any s/// variant. That change moves this line from #4 badness to #28 badness. Still having trouble with the floats, though.

Re^2: In search of an efficient query abstractor
by xaprb (Scribe) on Dec 07, 2008 at 21:00 UTC

    So it occurs to me there's an OBVIOUS reason why this regex might be the most costly: as other commenters have mentioned, I'm doing this on a bunch of stuff that might be embedded inside of strings, and what have you. So even if I leave all the regexes as they are, I think I can just reorder them and make things potentially a lot faster -- even if I do still have multiple passes through the string. I'll take a look at a strategy of "first, throw away as much as I can, then work with what's left."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://728756]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-25 13:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found