My general question is how to improve the performance of a regular expression. I have a fairly simple log analysis program, and I've profiled it, and it spends more time matching the regular expressions that I've made for it than anything else. This may be because my input is rather large, and my "analysis" is to feed it to Text::CSV.
In any case, it occurred to me that I have no idea how to improve matching performance. I'm assuming there is not something like Devel::NYTProf for regular expressions. I don't know much about how the engine works under the hood, so I can't make any guesses about what might be taking the most time. I know that it's possible to write an expression that takes a year and a day to match, but I don't know the particulars.
The actual expressions I'm using are in the readmore block below. My input files are Apache logs in a somewhat customized format.
# Things like ( (?:internal_ip){0} . ) were originally Perl 5.10
# named captures like (?<internal_ip> . ), but I don't
# expect to have 5.10 on the target.
# [04/Oct/2009:06:25:20 -0500]
my $time_rx = qr{
\[
( (?:day){0} \d\d )
/
( (?:mon){0} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec )
/
( (?:year){0} \d{4} )
:
( (?:hour){0} \d\d ) : ( (?:min){0} \d\d ) : ( (?:sec){0} \d\d )
\s
( (?:tz){0} [+-] \d{4} )
\]
}xms;
my @time_rx_fields = qw( day mon year hour min sec tz );
my $request_rx = qr{
\"
( (?:method){0} [A-Z]+ )
\s
( (?:url){0} .*? )
\s
( (?:protocol){0} HTTP/1.\d )
\"
}xms;
my @request_rx_fields = qw( method url protocol );
my $ip_rx = qr{ [12]?\d?\d (?: \. [12]?\d?\d ){3} }xms;
my $pid_log_rx = qr{
^
( (?:user){0} \S+ )
\s
$time_rx
\s
$request_rx
\s
( (?:status){0} \d{3} )
\s
( (?:bytes){0} - | \d+ )
\s
\[ ( (?:pid){0} \d+ ) \]
\s*
$
}xms;
my @pid_log_rx_fields = (
'user', @time_rx_fields, @request_rx_fields, 'status',
'bytes', 'pid'
);
my $frontend_log_rx = qr{
^
( (?:external_ip){0} $ip_rx )
\s* , \s*
( (?:internal_ip){0} $ip_rx )
\s
( (?:group){0} \S+ )
\s
( (?:user){0} \S+ )
\s
$time_rx
\s
$request_rx
\s
( (?:status){0} \d{3} )
\s
( (?:bytes){0} - | \d+ )
\s
( (?:ms){0} \d+ )
\s*
$
}xms;
my @frontend_log_rx_fields = (
'external_ip',
'internal_ip',
'group',
'user',
@time_rx_fields,
@request_rx_fields,
'status',
'bytes',
'ms'
);
# Those get used here:
sub line_in {
my ( $fh, $rx, $fields_ref ) = @_;
return if ref $fields_ref ne ref [];
my $line = <$fh>;
return if ! defined $line;
my @captures = $line =~ $rx;
return if ! @captures;
return { line => $line, map { $fields_ref->[$_] => $captures[$_] }
0 .. $#{$fields_ref} };
}
I'd welcome any suggestions on this particular problem, but I'm really interested in the more general question of how I could answer this question myself. Are there heuristics you've learned? Is there a tutorial on this topic? Any guidance would be appreciated!
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.