Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Efficient Log File Parsing with Regular Expressions

by hackdaddy (Hermit)
on Dec 06, 2002 at 23:48 UTC ( [id://218185]=perlquestion: print w/replies, xml ) Need Help??

hackdaddy has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to parse log files as efficiently as possible in Perl. In the following code snippet, I need to grab the first 18 fields, the next 40 characters, another 40 characters, and then the remaining fields in the string. The fields can be variable as you can see in the test data string.

Is there a faster way to do this in Perl? Is there a better regular expression to grab the first 18 fields?

Without a loss of speed, can I create a class that blesses the regex and has methods for returning the elements of the log file line? What is the fastest way to process log files without using, for instance, inline C? Any assistance will be greatly appreciated. Thanks.
#!/usr/local/bin/perl -w use strict; my $testdata=<<TESTDATA; -3 1 2 3 4 5 6657 7 8 9 10 11 12 13 14 15 16 20021013000000 NM 1 : + SR9550/1-SR9551/1 16S 1 12 LINE WEST + 0 0 -3 2 67 0 0 2 6657 2 1 0 0 0 0 4 131 0 0 20021013000000 Test021011 + 0 + 0 -3 3 67 0 0 2 6657 2 1 0 0 0 0 4 131 0 0 20021013000000 Test021011a + 0 + 0 -3 4 67 0 9 6 6657 2 1 0 0 0 0 6 131 0 0 20021013000000 NM 1 : + SR9550/1-SR9551/1 16S 1 18 LINE EAST 0 + 0 -3 5 67 0 0 2 6657 2 1 0 0 0 0 4 131 0 0 20021013001500 Test021011 + 0 + 0 -3 6 67 0 9 2 6657 2 1 0 0 0 0 6 131 0 0 20021013001500 NM 1 : + SR9550/1-SR9551/1 16S 1 12 LINE WEST 0 + 0 -3 7 67 0 0 2 6657 2 1 0 0 0 0 4 131 0 0 20021013001500 Test021011a + 0 + 0 -3 8 67 0 9 6 6657 2 1 0 0 0 0 6 131 0 0 20021013001500 NM 1 : + SR9550/1-SR9551/1 16S 1 18 LINE EAST 0 + 0 -3 9 67 0 0 2 6657 2 1 0 0 0 0 4 131 0 0 20021013003000 Test021011 + 0 + 0 -3 10 67 0 9 2 6657 2 1 0 0 0 0 6 131 0 0 20021013003000 NM 1 : + SR9550/1-SR9551/1 16S 1 12 LINE WEST +0 0 -3 11 67 0 0 2 6657 2 1 0 0 0 0 4 131 0 0 20021013003000 Test021011a + +0 0 -3 12 67 0 9 6 6657 2 1 0 0 0 0 6 131 0 0 20021013003000 NM 1 : + SR9550/1-SR9551/1 16S 1 18 LINE EAST +0 0 TESTDATA my @data; @data = split( '\n', $testdata ); my $line; my $str_18_fields; my $str_40_chars1; my $str_40_chars2; my $str_remain; my $regex = qr/^-((\S+\s+){18})(.{40})(.{40})(.+)/; foreach $line (@data) { if ($line =~ /^-3/) { $line =~ m/$regex/; $str_18_fields = $1; $str_40_chars1 = $3; $str_40_chars2 = $4; $str_remain = $5; $str_40_chars1 =~ s!\|!_!; $str_40_chars2 =~ s!\|!_!; print "\$str_18_fields = $str_18_fields\n"; print "\$str_40_chars1 = $str_40_chars1\n"; print "\$str_40_chars2 = $str_40_chars2\n"; print "\$str_remain = $str_remain\n\n"; } } # end foreach

Replies are listed 'Best First'.
Re: Efficient Log File Parsing with Regular Expressions
by BrowserUk (Patriarch) on Dec 07, 2002 at 01:16 UTC

    I did some crude benchmarking and the only speedups for your regex I could find were:

    1. Anchor the end of the regex with a $. 2-3%
    2. Don't capture the inner group, use (?:\S+\s+) 3-4%
    3. use study on $line before the if() statement 2-3%

    A total benefit of 8-9% timing the match/assignment only, and I could only measure the changes by repeating the matching/assignment 1000 times for each line! Overall, not a great deal to be gained. For instance, just including the four print statements into the timing, even when redirecting the output to the null device, increased the time taken to process each line by 300%. Making the small savings I described above <1% in total.

    You would gain more by

    • not performing the substitutions of '|' for '_' which appeared nowhere in your test data.
    • If your real data does need this processing, do it in one pass prior to breaking the line up.
    • not slurping your data into a string and then spliting it to process it line by line.

      I realise this is only a test program, but if you are doing a similar thing on your real program, especially if the file is large, you would be much better off using while($line = <DATAFILE> ) {...}

    Whatever differences the changes I described make, they are likely to pale into insignificance when compared to whatever you are doing with the data you have extracted from each line. Even just writing this to another file will completely marginalise any savings.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: Efficient Log File Parsing with Regular Expressions
by tadman (Prior) on Dec 07, 2002 at 00:27 UTC
    You might want to construct your regular expression that captures all of the parameters separately. If you're weary about typing in an expression so large, you can construct it as a string and use that to match.

    With the right tinkering, you can get all of your elements out in a single array. Using $DIGIT references limits how much you can get with each pass, since only 1 through 9 are available and you have many more elements than that.

    I'd do something like this instead:
    my $regex = '^-' .('(\S+)\s+'x18) .'(.{40})' .'(.{40})' .'(.*)';
    You should be able to use this as usual:
    if (my @match = /$regex/o) { # ... Use $match[0] through $match[20] }
    Of course, if you're worried about speed, you'd spend some time with the Benchmark library testing variations.
Re: Efficient Log File Parsing with Regular Expressions
by Enlil (Parson) on Dec 07, 2002 at 00:59 UTC
    This runs a little faster than yours, but you might want to keep trying variations using the Benchmark module. The only real difference between mine and yours is that I am assuming that the start of the $str_40_chars1 always begins with something other than a digit (0-9) and that everything before it is either a digit or a space (as your test data seems to indicate). Anyhow here is the benchmarking that I did:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://218185]
Approved by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-25 22:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found