comment on

My general question is how to improve the performance of a regular expression. I have a fairly simple log analysis program, and I've profiled it, and it spends more time matching the regular expressions that I've made for it than anything else. This may be because my input is rather large, and my "analysis" is to feed it to Text::CSV.

In any case, it occurred to me that I have no idea how to improve matching performance. I'm assuming there is not something like Devel::NYTProf for regular expressions. I don't know much about how the engine works under the hood, so I can't make any guesses about what might be taking the most time. I know that it's possible to write an expression that takes a year and a day to match, but I don't know the particulars.

The actual expressions I'm using are in the readmore block below. My input files are Apache logs in a somewhat customized format.

# Things like ( (?:internal_ip){0} . ) were originally Perl 5.10
# named captures like (?<internal_ip> . ), but I don't
# expect to have 5.10 on the target.

# [04/Oct/2009:06:25:20 -0500]
my $time_rx = qr{
    \[
    ( (?:day){0} \d\d )
    /
    ( (?:mon){0} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec )
    /
    ( (?:year){0} \d{4} )
    :
    ( (?:hour){0} \d\d ) : ( (?:min){0} \d\d ) : ( (?:sec){0} \d\d )
    \s
    ( (?:tz){0} [+-] \d{4} )
    \]
}xms;
my @time_rx_fields = qw( day mon year hour min sec tz );

my $request_rx = qr{
    \"
    ( (?:method){0} [A-Z]+ )
    \s
    ( (?:url){0} .*? )
    \s
    ( (?:protocol){0} HTTP/1.\d )
    \"
}xms;
my @request_rx_fields = qw( method url protocol );

my $ip_rx = qr{ [12]?\d?\d (?: \. [12]?\d?\d ){3} }xms;

my $pid_log_rx = qr{
    ^
    ( (?:user){0} \S+ )
    \s
    $time_rx
    \s
    $request_rx
    \s
    ( (?:status){0} \d{3} )
    \s
    ( (?:bytes){0} - | \d+ )
    \s
    \[ ( (?:pid){0} \d+ ) \]
    \s*
    $
}xms;
my @pid_log_rx_fields = (
    'user', @time_rx_fields, @request_rx_fields, 'status',
    'bytes', 'pid'
);

my $frontend_log_rx = qr{
    ^
    ( (?:external_ip){0} $ip_rx )
    \s* , \s*
    ( (?:internal_ip){0} $ip_rx )
    \s
    ( (?:group){0} \S+ )
    \s
    ( (?:user){0} \S+ )
    \s
    $time_rx
    \s
    $request_rx
    \s
    ( (?:status){0} \d{3} )
    \s
    ( (?:bytes){0} - | \d+ )
    \s
    ( (?:ms){0} \d+ )
    \s*
    $
}xms;
my @frontend_log_rx_fields = (
    'external_ip',
    'internal_ip',
    'group',
    'user',
    @time_rx_fields,
    @request_rx_fields,
    'status',
    'bytes',
    'ms'
);

# Those get used here:
sub line_in {
    my ( $fh, $rx, $fields_ref ) = @_;

    return if ref $fields_ref ne ref [];

    my $line = <$fh>;
    return if ! defined $line;

    my @captures = $line =~ $rx;
    return if ! @captures;

    return { line => $line, map { $fields_ref->[$_] => $captures[$_] }
                                0 .. $#{$fields_ref} };
}
[download]

I'd welcome any suggestions on this particular problem, but I'm really interested in the more general question of how I could answer this question myself. Are there heuristics you've learned? Is there a tutorial on this topic? Any guidance would be appreciated!

In reply to How do I optimize a regular expression? by kyle

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks