http://qs321.pair.com?node_id=811525

kyle has asked for the wisdom of the Perl Monks concerning the following question:

My general question is how to improve the performance of a regular expression. I have a fairly simple log analysis program, and I've profiled it, and it spends more time matching the regular expressions that I've made for it than anything else. This may be because my input is rather large, and my "analysis" is to feed it to Text::CSV.

In any case, it occurred to me that I have no idea how to improve matching performance. I'm assuming there is not something like Devel::NYTProf for regular expressions. I don't know much about how the engine works under the hood, so I can't make any guesses about what might be taking the most time. I know that it's possible to write an expression that takes a year and a day to match, but I don't know the particulars.

The actual expressions I'm using are in the readmore block below. My input files are Apache logs in a somewhat customized format.

# Things like ( (?:internal_ip){0} . ) were originally Perl 5.10 # named captures like (?<internal_ip> . ), but I don't # expect to have 5.10 on the target. # [04/Oct/2009:06:25:20 -0500] my $time_rx = qr{ \[ ( (?:day){0} \d\d ) / ( (?:mon){0} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec ) / ( (?:year){0} \d{4} ) : ( (?:hour){0} \d\d ) : ( (?:min){0} \d\d ) : ( (?:sec){0} \d\d ) \s ( (?:tz){0} [+-] \d{4} ) \] }xms; my @time_rx_fields = qw( day mon year hour min sec tz ); my $request_rx = qr{ \" ( (?:method){0} [A-Z]+ ) \s ( (?:url){0} .*? ) \s ( (?:protocol){0} HTTP/1.\d ) \" }xms; my @request_rx_fields = qw( method url protocol ); my $ip_rx = qr{ [12]?\d?\d (?: \. [12]?\d?\d ){3} }xms; my $pid_log_rx = qr{ ^ ( (?:user){0} \S+ ) \s $time_rx \s $request_rx \s ( (?:status){0} \d{3} ) \s ( (?:bytes){0} - | \d+ ) \s \[ ( (?:pid){0} \d+ ) \] \s* $ }xms; my @pid_log_rx_fields = ( 'user', @time_rx_fields, @request_rx_fields, 'status', 'bytes', 'pid' ); my $frontend_log_rx = qr{ ^ ( (?:external_ip){0} $ip_rx ) \s* , \s* ( (?:internal_ip){0} $ip_rx ) \s ( (?:group){0} \S+ ) \s ( (?:user){0} \S+ ) \s $time_rx \s $request_rx \s ( (?:status){0} \d{3} ) \s ( (?:bytes){0} - | \d+ ) \s ( (?:ms){0} \d+ ) \s* $ }xms; my @frontend_log_rx_fields = ( 'external_ip', 'internal_ip', 'group', 'user', @time_rx_fields, @request_rx_fields, 'status', 'bytes', 'ms' ); # Those get used here: sub line_in { my ( $fh, $rx, $fields_ref ) = @_; return if ref $fields_ref ne ref []; my $line = <$fh>; return if ! defined $line; my @captures = $line =~ $rx; return if ! @captures; return { line => $line, map { $fields_ref->[$_] => $captures[$_] } 0 .. $#{$fields_ref} }; }

I'd welcome any suggestions on this particular problem, but I'm really interested in the more general question of how I could answer this question myself. Are there heuristics you've learned? Is there a tutorial on this topic? Any guidance would be appreciated!