Matching simple patterns - is there a faster way?

mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching simple patterns - is there a faster way? by tobyink (Canon) on Aug 09, 2012 at 07:38 UTC
Realistically if you're downloading thousands of HTML pages and matching them against a regexp, it's unlikely to be the regexp that's slowing you down. The first place to look at optimizing is the HTTP layer itself. When you request the pages are you sending an `Accept-Encoding` header? In many cases, this will allow the pages to be transmitted with compression. (Many of the Perl HTTP libs will automatically handle the decompression for you.) If you're using LWP::UserAgent, perhaps try switching to Furl::HTTP which is somewhat faster. It doesn't have quite as many features as LWP::UserAgent, but it might have enough for you. Could you try requesting pages in parallel rather than one at a time? If so, take a look at AnyEvent::HTTP. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re^2: Matching simple patterns - is there a faster way? by mrguy123 (Hermit) on Aug 09, 2012 at 08:09 UTC
You're indeed correct that the HTML download is the bottleneck, and I am doing a lot of work to optimize that (including parallel processing) However, if I can make the regexes a bit faster it will also help, and to be honest I am also curious about the answer to this question	[reply]
Re: Matching simple patterns - is there a faster way? by Erez (Priest) on Aug 09, 2012 at 07:26 UTC
is regex matching the fastest way to go? Depends. In Perl it probably is, being a well implemented algorithm. In general it's not, since the regex algorithm isn't as fast as the one you find in Unix grep and similar, but is more robust and allows for a lot of clever and powerful matching. But that's obviously the wrong question, given that a: this looks like a premature optimization, and b: Perl isn't that fast to begin with, due to a lot of stuff that goes behind the scenes which allows for all the flexibility and DWIMability. A better, I believe, course would be to profile the entire program run, see what are the bottlenecks and then figure how to better iron them out, rather than assuming the regex engine will be your biggest problem, rather than, say, Perl's IO. Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert	[reply]
Re^2: Matching simple patterns - is there a faster way? by mrguy123 (Hermit) on Aug 09, 2012 at 07:55 UTC
Hi Erez, you're correct that the bottleneck is not the regex matching (it is the HTML downloading). However, if I can make the regex matching a bit faster it will be helpful Also, as I mentioned before, this is I think an interesting question that I would like to hear the Monks opinion on BTW, I enjoyed your talk on "The view from under the bridge" in the Israeli Perl Workshop earlier this year	[reply]
Re: Matching simple patterns - is there a faster way? by kcott (Archbishop) on Aug 09, 2012 at 06:50 UTC
You may find index to be a suitable alternative. You can use Benchmark to see if this, or other alternatives, are faster. -- Ken	[reply]
Re: Matching simple patterns - is there a faster way? by flexvault (Monsignor) on Aug 09, 2012 at 09:43 UTC
mrguy123, I benchmarked the code replacing the 'print' with '{ }': `Rate index regex regex 1562500/s -- -34% index 2380952/s 52% --` [download] Your not testing print, are you? What is funny, is that I ran the code 3 times, and the exact same result occurred all 3 times. I usually take the middle one! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re^2: Matching simple patterns - is there a faster way? by mrguy123 (Hermit) on Aug 09, 2012 at 11:49 UTC
Hi, please see my benchmark test use strict; use Benchmark qw(cmpthese); # A big, sorted list (475254 elements) my @list = ('a'..'zzzz'); ##Put the list into a string my $string = join(/ /, @list); $string .= "\n"; cmpthese(0, { # Looking for a random item in @list with grep index => sub { # A random element in the list my $pattern = 'fqsu fqsv fqsw fqsx fqsy'; my $result = index($string, $pattern); }, # Looking for a random item in @list linearly (with foreach) regex => sub { my $pattern = 'fqsu fqsv fqsw fqsx fqsy'; $string =~ /$pattern/; }, } ); [download] I also did another test where I read a proper HTML file and received similar results. Can you please tell me what you did different? Everybody seems to think I'm lazy I don't mind, I think they're crazy	[reply] [d/l]
Re^3: Matching simple patterns - is there a faster way? by flexvault (Monsignor) on Aug 09, 2012 at 14:15 UTC
mrguy123, As requested ( I typed this in from the code ): `sub index { if ( index('yabba dabba doo','abba' ) >= 0 ) { } } sub regex { $_ = 'yabba dabba doo'; if ( /abba/ ) { } } cmpthese ( 500_000, { regex => sub { &regex }, index => sub { &index }, }, );` [download] I have found 'index' usually faster than 'regex' in this type of situation, but 'regex' has more power when needed. Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re^2: Matching simple patterns - is there a faster way? by mrguy123 (Hermit) on Aug 09, 2012 at 12:15 UTC
Hi flexvault, I am assuming you did the benchmark on the 'yabba dabba doo' example, which makes perfect sense because that is what I posted. However, the real data is much larger and that may cause the difference in the benchmark You can see my code example below for regex/index matching with larger amount of data	[reply]
Re: Matching simple patterns - is there a faster way? by Anonymous Monk on Aug 09, 2012 at 07:29 UTC
Any ideas? The regex engine is always slower, but it really doesn't matter, 1000s of files isn't that much So for now regex matching is still faster. I am doing more testing to make sure this is the correct result. Post your code, your benchmark might be flawed ( Compare to this benchmark Re: regex alternation )	[reply]


We don't bite newbies here... much
	PerlMonks