Speed of Perl Regex Engine

Clovis_Sangrail has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I use Perl to generate daily audit reports from sets of Jounal Log Files produced by GT.M, an implementation of the MUMPS database/language. Each Journal line of interest includes a Username, a Global Variable and a description of the transaction on it. The report just presents a listing and count of the Global Variable modificatons, broken out by Username.

The customer wanted the capability to ignore some Globals that were not of interest. They can edit a file of such Globals, and I read that file and build an Inclusive-Or type of Regex that I pass to the Perl program as a commandline parameter. The program matches the Global Variable name from each Journal line against that Regex, and skips it if found.

But I did not realize just how popular this capability would be! I figured there would only ever be a few such Globals to skip, but the Customer has entered 54 of them so far, and they say there will be more! The Regex that I give to the Perl program is now about 750 characters long, and some of the bigger banks being audited produce over a million lines of Journal each day.

The reports for those banks do take noticeably longer to produce than when the system first went online, and I don't have much knowledge of or feel for the performance of the Perl Regex engine. Is it linear, like will it take ten times as long to match against a 600-character Regex than against a 60-character one?

I realize that this is just the sort of thing that enterprising Perl students study via test programs, and I may do that sort of thing. But I also do want to be able to tell the folks who sign my check that I am asking around, too.

Comment on Speed of Perl Regex Engine

Replies are listed 'Best First'.
Re: Speed of Perl Regex Engine by moritz (Cardinal) on Nov 28, 2012 at 16:23 UTC
The reports for those banks do take noticeably longer to produce than when the system first went online That sounds as if lots of stuff might have been changed in between. Run a profiler over the script(s) and see where the time is actually spent. I don't have much knowledge of or feel for the performance of the Perl Regex engine. Is it linear, like will it take ten times as long to match against a 600-character Regex than against a 60-character one? In general, it doesn't depend much on the length of regex, but on the amount of backtracking and searching that the regex engine has to do. If it's just a big alternation of constant strings, and you use perl 5.10.0 or newer, the trie optimization in the regex engine should handle that case very well (sub-linear even). If your regex grows too big, try increasing ${^RE_TRIE_MAXBUF} -- but only if it's the regex that's actually slow. And as already mentioned, if you can solve your problem through a hash lookup, that would be even better. Perl 6 - the future is here, just unevenly distributed	[reply]
Re: Speed of Perl Regex Engine by runrig (Abbot) on Nov 28, 2012 at 16:17 UTC
Does it need to be a regex? If you can live with an exact match, then I'd use a hash: `my %want_global = map { ($_ => 1) } qw( THIS THAT ANOTHER ); ... if ( $want_global{$global} ) { ... }` [download] If you really need to have regexes, then put them in an array instead of a single regex, it will generally perform better than a joined single regex (and put the most likely things to match first, if possible): `my @wanted_re = ( qr/^AB/, qr/^CD/, ); sub want_global { my $g = shift; for my $re (@wanted_re) { return 1 if $g =~ /$re/; } return; }` [download] Update: I see that maybe you want 'unwanted global' logic...whatever...the above still applies...adjust to suit needs.	[reply] [d/l] [select]
Re: Speed of Perl Regex Engine by Clovis_Sangrail (Beadle) on Nov 28, 2012 at 18:19 UTC
Thanks runrig & moritz. I'm glad that you suggested the use of a hash, I had said the same to my managers, and now I can say independent experts recommend it too.(I'd never heard of "map" before...) It'd be interesting to run the profiler, I've never done that.	[reply]
Re: Speed of Perl Regex Engine by flexvault (Monsignor) on Nov 28, 2012 at 21:26 UTC
Clovis_Sangrail, I'm not a big regex user, so my comments may not reflect what others have experienced. A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is: `my $stime = gettimeofday; $body = lc($body); ## All whitespace and punctuation has be +en removed foreach $var ( @BD_data ) { my $sz = index ( $body, $var ); if ( $sz >= 0 ) { . . . last; } } my $looptime = gettimeofday - $stime; ## This value is logged!` [download] In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'. Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first. So my suggestion is to try using 'index' and see if it helps. Good Luck! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re: Speed of Perl Regex Engine by Jenda (Abbot) on Nov 29, 2012 at 16:43 UTC
You said you build the regex and pass it to the program as a parameter ... how do you use it then? `my $regexp = @ARGV[1]; ... next if ($global =~ /$regexp/); ...` [download] `my $regexp = @ARGV[1]; ... next if ($global =~ /$regexp/o); ...` [download] `my $regexp = qr/@ARGV[1]/; ... next if ($global =~ $regexp); ...` [download] The first version compiles the regexp each time you use it, the other two just once. For a longer regexp this may make a big difference. Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l] [select]
Re^2: Speed of Perl Regex Engine by runrig (Abbot) on Nov 29, 2012 at 19:30 UTC
I believe in later versions of Perl (>= 5.6), this is no longer true. As long as `$regexp` doesn't change, `/$regexp/` isn't compiled again, making "/o" (mostly) obsolete.	[reply] [d/l] [select]


more useful options
	PerlMonks