Re: In search of an efficient query abstractor

The regexes do a lot of backtracking and stuff.

Which one? I don't see anywhere that backtracking could occur. Everything there's a choice, only one path can possibly match.

   $query =~ s{
              (?<![\w.+-])
              [+-]?
              (?:
                \d+
                (?:[.]\d*)?
                |[.]\d+
              )
              (?:e[+-]?\d+)?
              \b
             }
             {N}gx;                             # Float/real into N
[download]

The trailing \b is buggy.

"12345 "  -> "N "
"12345. " -> "N. "
[download]

It matched "12345." until the "\b" was reached, then it backtracked to release the "." from the match. Oops, I guess there is backtracking, but only because of the bug.

Comment on Re: In search of an efficient query abstractor Select or Download Code

Replies are listed 'Best First'.
Re^2: In search of an efficient query abstractor by xaprb (Scribe) on Dec 07, 2008 at 20:36 UTC
Hmmm. Good catch, although I thought I had a test case to ensure that was handled correctly. I'll check that :) You seem to know more about regexes than I do. Why would you guess this particular regex is slow? Line-by-line profiling proves that it is consuming the vast majority of the time. It consumes almost 300 CPU seconds, and the next most expensive line in this code consumes 89 seconds, on an 8GB file. The next-most-worst offender is `$query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg;` followed by `$query =~ s/\s{2,}/ /g;`	[reply] [d/l] [select]
Re^3: In search of an efficient query abstractor by ikegami (Patriarch) on Dec 07, 2008 at 20:46 UTC
Are you doing the whole 8GB file at once? If your string starts with "a b c d", `$query =~ s/\s{2,}/ /g;` needs to copy 32GB of text. Just for the first 10 characters.	[reply] [d/l]
Re^4: In search of an efficient query abstractor by xaprb (Scribe) on Dec 07, 2008 at 20:57 UTC
No, it's done one entry at a time. Each entry is a header with some commented lines, followed by a query. There are special cases, but it generally looks like `# Time: 071015 21:43:52 # User@Host: root[root] @ localhost [] # Query_time: 2 Lock_time: 0 Rows_sent: 1 Rows_examined: 0 use test; select sleep(2) from n;` [download]	[reply] [d/l]
Re^3: In search of an efficient query abstractor by Corion (Patriarch) on Dec 07, 2008 at 20:50 UTC
At least for the last case, Perl has an optimized version: `$query =~ tr[ ][]s;` [download] and it should be faster or at least as fast as the `s///` version. Another version to try would be `s/\s+/ /g` - there is no need to use the counting variant of `{2,}`, and skipping might be slower than just writing the output "replacement".	[reply] [d/l] [select]
Re^4: In search of an efficient query abstractor by xaprb (Scribe) on Dec 08, 2008 at 01:19 UTC
`$query =~ tr[ \n\t\r\f][ ]s;` turns out to be a lot faster than any s/// variant. That change moves this line from #4 badness to #28 badness. Still having trouble with the floats, though.	[reply] [d/l]
Re^2: In search of an efficient query abstractor by xaprb (Scribe) on Dec 07, 2008 at 21:00 UTC
So it occurs to me there's an OBVIOUS reason why this regex might be the most costly: as other commenters have mentioned, I'm doing this on a bunch of stuff that might be embedded inside of strings, and what have you. So even if I leave all the regexes as they are, I think I can just reorder them and make things potentially a lot faster -- even if I do still have multiple passes through the string. I'll take a look at a strategy of "first, throw away as much as I can, then work with what's left."	[reply]


Keep It Simple, Stupid
	PerlMonks