Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: line by line match on an array of strings

by pilcrow (Sexton)
on Jan 09, 2008 at 18:05 UTC ( [id://661438]=note: print w/replies, xml ) Need Help??


in reply to line by line match on an array of strings

Is there a more elegant solution?

Substituting speed for elegance, there are many quick wins for this sort of thing. Precompile your regexen with qr//, so each iteration doesn't compile them anew. Break out of the loop after the first successful match, if appropriate (the List::MoreUtils any approach does this, I think). Optimize the regex list and its application: can you profitably combine them, add modifiers like \b or ^, etc? Keep track of regex hit counts and sort your regex list now and then to apply the most common matches first, if appropriate.

The biggest win typically comes from rethinking the problem, of course. Without really knowing what you're attempting, it looks as though you might be trying to do some comparatively simple token matches. Something like

while (<INPUT>) { if (/\b keyword \s+ (\w+)/ and exists $keywords{ $1 }) { # .. do something with the token in $1 } }
might do the trick. We'd need to see specific examples to give more specific advice. -Mike

Replies are listed 'Best First'.
Re^2: line by line match on an array of strings
by WoodyWeaver (Monk) on Jan 09, 2008 at 21:38 UTC
    Just trying to restate what Mike was saying.

    > however with a few hundred thousand lines to seach, and an array of a few hundred it is far too slow.

    It could be slow because it has to do a lot of work at each end step, which is where optimizing the regex helps.

    I think it is slow because your looping is of order (a few hundred thousand) TIMES (a few hundred).

    It would be much better if the looping is of order (a few hundred thousand) times a big constant. You might be able to get away with that by 'precompile your regexen' (wonderful phrase) -- or imho more likely if your line can be broken into a small number of tokens, just do a dispatch table on tokens broken out from the line.

    --woody

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://661438]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-04-19 11:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found