Pimp My RegEx

heathen has asked for the wisdom of the Perl Monks concerning the following question:

Mighty monks, I'm trying to parse a rather large log file. With small quantities of data things run fine. When pointed at the entire 55mb, 366,000 line log file, the lights dim in the server room! Is this regex optimized? Runs fine on a little data, chokes on a lot. - Yes, I'm slurping in the WHOLE 55mb file at one time. The script is running on 4 CPU SPARC w/ 3+ gig of memory, so 55mb "shouldn't" be a real problem for it. - I'm slurping because a log file entry my span multiple lines. - What could I do better & smarter? Here's a sample script & data that parses 10 lines of the log file.

#!/usr/local/bin/perl

use warnings;
use strict;
use English;
use Data::Dumper;
use Time::HiRes 'time';

my $logfile;
my $dateRegex = qr/\d{4}-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/;
my $parse_log_entry = qr/^($dateRegex.*?)(?=$dateRegex)/ms;

{
    # "slurp" the entire log file into memory at once
    local $INPUT_RECORD_SEPARATOR;
    $logfile = <DATA>;
}
my $count = 0;                  # Initialize counter
my $start = time();             # Start the timer
while( $logfile =~ m/$parse_log_entry/g ) {
    $count++;                   # count number of parsed strings
}
my $end = time();               # Stop the timer
my $elapsed = $end - $start;    # How long did that take?
my $average = $elapsed/$count;  # Average processing time

printf "Parsed $count log file entries in %.4f seconds, averaging %.4f
+\n", $elapsed, $average;

exit;


__DATA__
2004-01-05 22:37:48:879 : xscWnpStation_WNP_N_1 : EXCEPTION : Resource
+Limitation CLH0070E CANNOT ROUTE NPANXX OF 801-750-1742 (FAX) msg_id:
+ pOm94rLCG2CC39dj
TRACKINID=200402043229577 : REQ_NO=62320402446422 : REQ_INSTANCE=0020 
+: NNSP=6232 : ONSP=6875 : NLSP=6232 : OLSP=null : MSGTYPE=NOT
2004-01-05 22:38:52:019 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007
+ Database related error. : in xscPortResponse_PRI: Cannot find a conf
+irmed request for SUP3 msg_id: pOm94vILG2CCIthy
TRACKINID=200402013135093 : REQ_NO=65290402734448 : REQ_INSTANCE=0002 
+: NNSP=6529 : ONSP=6006 : NLSP=null : OLSP=6006 : MSGTYPE=PRI
2004-01-05 22:43:02:239 : xscWnpStation_WNP_N_1 : EXCEPTION : InputDat
+aValidationError MPE2099E RESP NUM ON SUP MUST MATCH RESP NUM ON RESP
+ONSE msg_id: 23f1bb:fa87b0edab:-4347
TRACKINID=200312312021455 : REQ_NO=62320312334661 : REQ_INSTANCE=0050 
+: NNSP=6232 : ONSP=6529 : NLSP=6232 : OLSP=6529 : MSGTYPE=NOT
2004-02-05 22:37:48:879 : xscWnpStation_WNP_N_1 : EXCEPTION : Resource
+Limitation CLH0070E CANNOT ROUTE NPANXX OF 801-750-1742 (FAX) msg_id:
+ pOm94rLCG2CC39dj
TRACKINID=200402043229577 : REQ_NO=62320402446422 : REQ_INSTANCE=0020 
+: NNSP=6232 : ONSP=6875 : NLSP=6232 : OLSP=null : MSGTYPE=NOT
2004-02-05 22:38:52:019 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007
+ Database related error. : in xscPortResponse_PRI: Cannot find a conf
+irmed request for SUP3 msg_id: pOm94vILG2CCIthy
TRACKINID=200402013135093 : REQ_NO=65290402734448 : REQ_INSTANCE=0002 
+: NNSP=6529 : ONSP=6006 : NLSP=null : OLSP=6006 : MSGTYPE=PRI
2004-02-05 22:43:02:239 : xscWnpStation_WNP_N_1 : EXCEPTION : InputDat
+aValidationError MPE2099E RESP NUM ON SUP MUST MATCH RESP NUM ON RESP
+ONSE msg_id: 23f1bb:fa87b0edab:-4347
TRACKINID=200312312021455 : REQ_NO=62320312334661 : REQ_INSTANCE=0050 
+: NNSP=6232 : ONSP=6529 : NLSP=6232 : OLSP=6529 : MSGTYPE=NOT
2004-02-05 22:43:50:769 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007
+ Database related error. : (Error Without Msg) In getRecordForSoaByTn
+: tracking_id not found for TN: 7145246400; REQ_NO: 652904017079; OWN
+ER: 6529
2004-02-05 22:44:51:979 : xscWnpStation_WNP_N_1 : EXCEPTION : InputDat
+aValidationError MPE0600E DUE DATE/TIME MUST EQUAL DESIRED DUE DATE/T
+IME TO CONFIRM REQUEST msg_id: pOm94t8TG2CDiqMP
TRACKINID=200402053272708 : REQ_NO=621404024940585 : REQ_INSTANCE=0001
+ : NNSP=6214 : ONSP=9740 : NLSP=null : OLSP=9740 : MSGTYPE=PRO
2004-02-05 22:47:12:879 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007
+ Database related error. : (Error Without Msg) In getRecordForSoaByTn
+: tracking_id not found for TN: 6193022949; REQ_NO: 652904027361; OWN
+ER: 6529
2004-02-05 22:49:50:059 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007
+ Database related error. : in xscDB_WnpPortResponse.insert: ORA-01400
+: cannot insert NULL into ("DBADMIN"."WNP_PORT_RESPONSE"."MESSAGE_TIM
+ESTAMP")
 msg_id: pOm94nEIG2CEtKQd
TRACKINID=200401092311869 : REQ_NO=65290401599145 : REQ_INSTANCE=0002 
+: NNSP=6529 : ONSP=6664 : NLSP=null : OLSP=6664 : MSGTYPE=PR2
2004-02-05 22:49:50:079 : xscWnpStation_WNP_N_1 : messageError : SMGWN
+P0009 Internal Message Error. : xscPortResponse_CLH_PR2 exception: xs
+cDB_WnpPortResponse: caught SQL Exception during insertion: ORA-01400
+: cannot insert NULL into ("DBADMIN"."WNP_PORT_RESPONSE"."MESSAGE_TIM
+ESTAMP")
 msg_id: pOm94nEIG2CEtKQd
TRACKINID=200401092311869 : REQ_NO=65290401599145 : REQ_INSTANCE=0002 
+: NNSP=6529 : ONSP=6664 : NLSP=null : OLSP=6664 : MSGTYPE=PR2
[download]

Comment on Pimp My RegEx Download Code

Replies are listed 'Best First'.
Re: Pimp My RegEx by dbwiz (Curate) on May 31, 2005 at 18:43 UTC
I would have said the same thing that dragonchild pointed out, i.e. that since you are matching the beginning of a line, you can apply your regexp line by line. However, I'll tell you something more that will save you 30 to 50% of execution time: Remove "use English;" and change `$INPUT_RECORD_SEPARATOR` into `$/`. This will avoid the malicious influence of the so called evil variables ($`, $&, $'). I have actually benchmarked your script with a few hundred lines of logs, and this change saves about 40% of execution time on my laptop.	[reply] [d/l] [select]
Re^2: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 18:56 UTC
I'll take your advice and pull out the "use english" command. I thought the "use english" syntax would be the kind thing to do for the benefit of my coworkers. They cringe when they see a complex regex and a "$/" will have them running to the camel book! Please elaborate on the "evil variables". I'm not really familiar with them and their implications.	[reply]
Re^3: Pimp My RegEx by jhourcle (Prior) on May 31, 2005 at 20:02 UTC
Sometimes, a comment will work just as well, without the overhead. Consider the following: `local $/; # INPUT_RECORD_SEPARATOR` As another option, in the documenation for 'English', there's a suggestion to use: `use English qw( -no_match_vars ) ;` to prevent the problems mentioned. As for the 'evil variables', the following note is in 'perldoc perlre': WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, $` or $', then patterns without capturing parentheses will not be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.	[reply] [d/l] [select]
Re^4: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 20:24 UTC
Re^3: Pimp My RegEx by radiantmatrix (Parson) on May 31, 2005 at 20:01 UTC
> They cringe when they see a complex regex and a "$/" will have them running to the camel book! Add a comment? `local $/ = $my_input_sep; # Custom input separator (instead of \n)` [download] Yoda would agree with Perl design: there is no `try{}`	[reply] [d/l]
Re^3: Pimp My RegEx by eyepopslikeamosquito (Archbishop) on May 31, 2005 at 20:50 UTC
Further to the other excellent responses, you can check if your script is infected with any of the evil variables with the Devel::SawAmpersand module.	[reply]
Re: Pimp My RegEx by demerphq (Chancellor) on May 31, 2005 at 18:51 UTC
Your regex is really horribly inefficient. Think about what it has to do to match that pattern. A much simpler way to do the same thing is: #!/usr/local/bin/perl use warnings; use strict; use English; use Data::Dumper; use Time::HiRes 'time'; my $logfile; my $count = 0; # Initialize counter my $start = time(); # Start the timer my $fullrec; while( <DATA>) { if (/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/) { $count++; #if ($fullrec) { # process($fullrec); #} $fullrec = $_; } else { $fullrec .= $_; } } #if (defined($fullrec)) { # process($fullrec); #} my $end = time(); # Stop the timer my $elapsed = $end - $start; # How long did that take? my $average = $elapsed/$count; # Average processing time printf "Parsed $count log file entries in %.4f seconds, averaging %.4f +\n", $elapsed, $average; exit; __DATA__ <SNIPPED DATA> [download] Ie you know that when a line starts with a data that it begins a record, this also implys it denotes the end of the previous record (if such a record exists). Thus you simply need to check each line to see if it begins the record, and then do something with the previous one that you have constructed. This also means that the pattern is anchored and only needs to compare Lr (the length of pattern) chars per line instead of the the Lf (the length of the file) times Lr that your code would do (with the look ahead assertion). So your pattern does something like 5500000023 character lookups, even worse most of those will be char class lookups* so they are inefficient to start with. If you use the line by line approach you are dealing with 36600023 lookups. Thats a LOT less. (Actually these are upper bounds, but i think the point is made) Update:*Fixed as per ikegami's reply. --- $world=~s/war/peace/g	[reply] [d/l]
Re^2: Pimp My RegEx by ikegami (Patriarch) on May 31, 2005 at 18:58 UTC
That should be: `while( <DATA>) { if (/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/) { $count++; #if (defined($fullrec)) { # process($fullrec); #} $fullrec = $_; } else { $fullrec .= $_; } } #if (defined($fullrec)) { # process($fullrec); #}` [download] or look at my solution to eliminate this redundancy.	[reply] [d/l]
Re^3: Pimp My RegEx by demerphq (Chancellor) on May 31, 2005 at 19:03 UTC
Indeed. Good catch. :-) --- $world=~s/war/peace/g	[reply]
Re^2: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 19:07 UTC
Thanks for sharing your wisdom! I can't believe how quickly you guys came up w/ replies. The PerlMonks community is a great resource! I wanted to use file slurp mode because so that could MOVE old log file entries from the current log into a seperate monthly log file. (and then tar/gzip 'em) I thought that by using the slurp mode, I could essentially edit the file in place without having to create a temporary file. The file system these logs live in is often constrained for space. Thanks again!	[reply]
Re^3: Pimp My RegEx by eyepopslikeamosquito (Archbishop) on May 31, 2005 at 21:20 UTC
I thought that by using the slurp mode, I could essentially edit the file in place without having to create a temporary file. No, that is unsound. You can corrupt your log files. Consider: Slurp the log file. Apply regex to file contents. Write the log file with new file contents. Now, if the write fails due to "disk full" (a distinct possibility in your environment) or because you lose power at that instant, or for any old reason, you have just corrupted your log file. Generally, you should write the modified file to a temporary file and rename the temporary file over the original only after checking that the temporary file was created without error.	[reply]
Re^3: Pimp My RegEx by demerphq (Chancellor) on May 31, 2005 at 19:13 UTC
I could essentially edit the file in place without having to create a temporary file. Generally this isn't possible. You can only overwrite the existing parts of the log, not slice them out of the file. The normal way to do what you want is to process your input records as shown here by me and several of the other monks and then process them and output them to the new file, and then delete the old file. You can't really remove from the beginning of a file while something is writing to the end. Anyway, glad to help. :-) --- $world=~s/war/peace/g	[reply]
Re^4: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 19:25 UTC
Re: Pimp My RegEx by dragonchild (Archbishop) on May 31, 2005 at 18:35 UTC
You don't need to slurp. Either the line matches the regex `/^$date_regex/` or it doesn't. (I'm assuming there's no chance the middle of a record will match that regex.) `perl -n -e '$count++ if /^REGEX/;}END{print "$count\n";' FILENAME` [download] And, yes, 55mb will blow out a 3G+ RAM. You should expect a 10-1 to 20-1 increase in memory usage over disk size. This is Perl ... In general, if you think something isn't in Perl, try it out, because it usually is. :-) "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"	[reply] [d/l] [select]
Re^2: Pimp My RegEx by Roy Johnson (Monsignor) on May 31, 2005 at 18:50 UTC
I had the same first impression, but the expression could capture more than one line, if the next line didn't start with the date regex. Still, you could read line-by-line and tack each line onto the previous match until a new match was found. Caution: Contents may have been coded under pressure.	[reply]
Re^3: Pimp My RegEx by dragonchild (Archbishop) on May 31, 2005 at 18:52 UTC
I was unclear - you wouldn't need the whole regex anymore. The problem has been restated, so the new regex should be more restrictive. At the very minimum, it would not be capturing across $/. In general, if you think something isn't in Perl, try it out, because it usually is. :-) "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"	[reply]
Re: Pimp My RegEx by ikegami (Patriarch) on May 31, 2005 at 18:55 UTC
The following uses a simpler regexp, and it doesn't slurp the file into memory. It's probably faster: `#!/usr/local/bin/perl use warnings; use strict; my $record_start_regexp = qr/^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/; sub do_something { print("[\n"); print($_[0]); print("]\n"); } { local $_; my $data; for (;;) { $_ = <DATA>; last if not defined $_; last if /$record_start_regexp/; } while (defined) { $data .= $_; $_ = <DATA>; if (!defined \|\| /$record_start_regexp/) { do_something($data); $data = ''; } } }` [download]	[reply] [d/l]
Re: Pimp My RegEx by Roy Johnson (Monsignor) on May 31, 2005 at 18:47 UTC
I get better performance with this: `my $parse_log_entry = qr/^$dateRegex/ms; ... my $prevpos = 0; while( $logfile =~ m/$parse_log_entry/g ) { $count++; # count number of parsed strings if ($prevpos) { # print "Do something with $prevpos .. $-[0]\n"; # print substr($logfile, $prevpos, $-[0]-$prevpos), "\n"; } $prevpos = $-[0]; }` [download] You would use the substr expression above to access what you're currently capturing, though you don't indicate that what you're capturing is actually used at all. . Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 20:30 UTC
You would use the substr expression above to access what you're currently capturing, though you don't indicate that what you're capturing is actually used at all. I made a grossly simplified version of my script to share with the forum. Posting the script in its entirety would have just bored folks to tears! ...but rest assured, I am doing something with the captured text. I'm not sure your solution will work for log entries that span multiple lines.	[reply]
Re^3: Pimp My RegEx by Roy Johnson (Monsignor) on May 31, 2005 at 20:35 UTC
I'm not sure your solution will work for log entries that span multiple lines. Why not? Caution: Contents may have been coded under pressure.	[reply]
Re: Pimp My RegEx by tlm (Prior) on May 31, 2005 at 18:35 UTC
I'm trying to understand the rationale behind your regex. Don't you want something more like `my $parse_log_entry = qr/^($dateRegex.?)(?=(?:^$dateRegex\|\z))/ms;` [download] ? If so, then it I would process the file line-wise, and count only those lines that begin with your date pattern. Or, if you need to do more than just counting, then something along the lines of `my $current; while ( <DATA> ) { if ( /^$dateRegex/ ) { process( $current ) if defined $current; $current = $_; } else { $current .= $_; } } process( $current ) if defined $current;` [download] Update:* Added the more extended second alternative. Update 2: Fixed bug (last line of code was missing). the lowliest monk	[reply] [d/l] [select]
Re^2: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 18:46 UTC
For the sake of brevity, I cut out my "payload" and just counted lines. I counted lines just to benchmark the regex times. My rationale was something like this: "find a line that starts with a date, capture all of the text (including new lines) until I can lookahead and find another date."	[reply]
Re^3: Pimp My RegEx by tlm (Prior) on May 31, 2005 at 18:57 UTC
Yes, but as originally written your regex would miss the last log entry, because it fails the positive lookahead assertion. Read more... (1081 Bytes) Regarding the payload, please see the update to my original reply. Update: Added further comments on the error in the original regex. the lowliest monk	[reply] [d/l] [select]
Re^4: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 20:19 UTC
Re: Pimp My RegEx by heathen (Acolyte) on May 31, 2005 at 21:11 UTC
Once again, thanks for the suggestions. My last benchmark showed 179,963 log file entries parsed in 7.17 seconds. Not bad! ....now time to go back and add the File I/O operations.	[reply]
Re^2: Pimp My RegEx by graff (Chancellor) on Jun 01, 2005 at 03:51 UTC
If you're using Perl 5.8.0 or (preferably) later, check out PerlIO::gzip, which makes it really simple and easy to have your output written in gzip-compressed form. UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).	[reply]
Re: Pimp My RegEx by TedPride (Priest) on May 31, 2005 at 19:22 UTC
If disk space is so low that you can't fit a second copy of the file, you need more disk space. Barring that, you could try creating a temporary RAM disk, copying the file into it, reading the file line by line, then writing back an interpreted copy to the original location. Reading from a RAM disk is significantly faster than reading from the hard disk, and the file will only require its actual size in memory instead of its Perl size. (be sure to create a backup, compressed copy of the file first, however - there's always a tiny chance your comp will crash or something will go wrong, and the copy of the file in the RAM disk will go poof)	[reply]
Re: Pimp My RegEx by jcoxen (Deacon) on Jun 01, 2005 at 14:25 UTC
I have to do stuff like this fairly often with a bunch of different format files so I just keep this around as a generic framework. I modify the regex statements, delimiter and number of fields as necessary but it's pretty much the same program every time. Here's what I use - lifted almost completely from various posts here on PM... `#!/usr/bin/perl use warnings; use strict; open INFILE, "< $ARGV[0]" or die "Could not open file $ARGV[0]: $!"; until(eof INFILE) { my $fn = $ARGV[0]; open OUTFILE, "> $fn.filtered" or die "Could not open file $fn.filtered $!"; while(<INFILE>) { my @line = split /\t/; $line[0] =~ s/(.{1,6})./$1/; $line[3] =~ s/(.{1,6})./$1/; $line[6] = $line[6]."\n"; $_ = join("\t", $line[0], $line[1], $line[2], $line[3], $line[ +4], $line[5], $line[6]); print OUTFILE; } }` [download] BTW, this code processed an 8 million line/385 mb file in about 33 minutes the other day on my old, anemic Sun Ultra 5 workstation. No brownouts, no sweat. System load never got above 1.05. Jack	[reply] [d/l]