Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Pimp My RegEx

by heathen (Acolyte)
on May 31, 2005 at 18:22 UTC ( [id://462148]=perlquestion: print w/replies, xml ) Need Help??

heathen has asked for the wisdom of the Perl Monks concerning the following question:

Mighty monks, I'm trying to parse a rather large log file. With small quantities of data things run fine. When pointed at the entire 55mb, 366,000 line log file, the lights dim in the server room! Is this regex optimized? Runs fine on a little data, chokes on a lot. - Yes, I'm slurping in the WHOLE 55mb file at one time. The script is running on 4 CPU SPARC w/ 3+ gig of memory, so 55mb "shouldn't" be a real problem for it. - I'm slurping because a log file entry my span multiple lines. - What could I do better & smarter? Here's a sample script & data that parses 10 lines of the log file.
#!/usr/local/bin/perl use warnings; use strict; use English; use Data::Dumper; use Time::HiRes 'time'; my $logfile; my $dateRegex = qr/\d{4}-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/; my $parse_log_entry = qr/^($dateRegex.*?)(?=$dateRegex)/ms; { # "slurp" the entire log file into memory at once local $INPUT_RECORD_SEPARATOR; $logfile = <DATA>; } my $count = 0; # Initialize counter my $start = time(); # Start the timer while( $logfile =~ m/$parse_log_entry/g ) { $count++; # count number of parsed strings } my $end = time(); # Stop the timer my $elapsed = $end - $start; # How long did that take? my $average = $elapsed/$count; # Average processing time printf "Parsed $count log file entries in %.4f seconds, averaging %.4f +\n", $elapsed, $average; exit; __DATA__ 2004-01-05 22:37:48:879 : xscWnpStation_WNP_N_1 : EXCEPTION : Resource +Limitation CLH0070E CANNOT ROUTE NPANXX OF 801-750-1742 (FAX) msg_id: + pOm94rLCG2CC39dj TRACKINID=200402043229577 : REQ_NO=62320402446422 : REQ_INSTANCE=0020 +: NNSP=6232 : ONSP=6875 : NLSP=6232 : OLSP=null : MSGTYPE=NOT 2004-01-05 22:38:52:019 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007 + Database related error. : in xscPortResponse_PRI: Cannot find a conf +irmed request for SUP3 msg_id: pOm94vILG2CCIthy TRACKINID=200402013135093 : REQ_NO=65290402734448 : REQ_INSTANCE=0002 +: NNSP=6529 : ONSP=6006 : NLSP=null : OLSP=6006 : MSGTYPE=PRI 2004-01-05 22:43:02:239 : xscWnpStation_WNP_N_1 : EXCEPTION : InputDat +aValidationError MPE2099E RESP NUM ON SUP MUST MATCH RESP NUM ON RESP +ONSE msg_id: 23f1bb:fa87b0edab:-4347 TRACKINID=200312312021455 : REQ_NO=62320312334661 : REQ_INSTANCE=0050 +: NNSP=6232 : ONSP=6529 : NLSP=6232 : OLSP=6529 : MSGTYPE=NOT 2004-02-05 22:37:48:879 : xscWnpStation_WNP_N_1 : EXCEPTION : Resource +Limitation CLH0070E CANNOT ROUTE NPANXX OF 801-750-1742 (FAX) msg_id: + pOm94rLCG2CC39dj TRACKINID=200402043229577 : REQ_NO=62320402446422 : REQ_INSTANCE=0020 +: NNSP=6232 : ONSP=6875 : NLSP=6232 : OLSP=null : MSGTYPE=NOT 2004-02-05 22:38:52:019 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007 + Database related error. : in xscPortResponse_PRI: Cannot find a conf +irmed request for SUP3 msg_id: pOm94vILG2CCIthy TRACKINID=200402013135093 : REQ_NO=65290402734448 : REQ_INSTANCE=0002 +: NNSP=6529 : ONSP=6006 : NLSP=null : OLSP=6006 : MSGTYPE=PRI 2004-02-05 22:43:02:239 : xscWnpStation_WNP_N_1 : EXCEPTION : InputDat +aValidationError MPE2099E RESP NUM ON SUP MUST MATCH RESP NUM ON RESP +ONSE msg_id: 23f1bb:fa87b0edab:-4347 TRACKINID=200312312021455 : REQ_NO=62320312334661 : REQ_INSTANCE=0050 +: NNSP=6232 : ONSP=6529 : NLSP=6232 : OLSP=6529 : MSGTYPE=NOT 2004-02-05 22:43:50:769 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007 + Database related error. : (Error Without Msg) In getRecordForSoaByTn +: tracking_id not found for TN: 7145246400; REQ_NO: 652904017079; OWN +ER: 6529 2004-02-05 22:44:51:979 : xscWnpStation_WNP_N_1 : EXCEPTION : InputDat +aValidationError MPE0600E DUE DATE/TIME MUST EQUAL DESIRED DUE DATE/T +IME TO CONFIRM REQUEST msg_id: pOm94t8TG2CDiqMP TRACKINID=200402053272708 : REQ_NO=621404024940585 : REQ_INSTANCE=0001 + : NNSP=6214 : ONSP=9740 : NLSP=null : OLSP=9740 : MSGTYPE=PRO 2004-02-05 22:47:12:879 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007 + Database related error. : (Error Without Msg) In getRecordForSoaByTn +: tracking_id not found for TN: 6193022949; REQ_NO: 652904027361; OWN +ER: 6529 2004-02-05 22:49:50:059 : xscWnpStation_WNP_N_1 : dbError : SMGWNP0007 + Database related error. : in xscDB_WnpPortResponse.insert: ORA-01400 +: cannot insert NULL into ("DBADMIN"."WNP_PORT_RESPONSE"."MESSAGE_TIM +ESTAMP") msg_id: pOm94nEIG2CEtKQd TRACKINID=200401092311869 : REQ_NO=65290401599145 : REQ_INSTANCE=0002 +: NNSP=6529 : ONSP=6664 : NLSP=null : OLSP=6664 : MSGTYPE=PR2 2004-02-05 22:49:50:079 : xscWnpStation_WNP_N_1 : messageError : SMGWN +P0009 Internal Message Error. : xscPortResponse_CLH_PR2 exception: xs +cDB_WnpPortResponse: caught SQL Exception during insertion: ORA-01400 +: cannot insert NULL into ("DBADMIN"."WNP_PORT_RESPONSE"."MESSAGE_TIM +ESTAMP") msg_id: pOm94nEIG2CEtKQd TRACKINID=200401092311869 : REQ_NO=65290401599145 : REQ_INSTANCE=0002 +: NNSP=6529 : ONSP=6664 : NLSP=null : OLSP=6664 : MSGTYPE=PR2

Replies are listed 'Best First'.
Re: Pimp My RegEx
by dbwiz (Curate) on May 31, 2005 at 18:43 UTC

    I would have said the same thing that dragonchild pointed out, i.e. that since you are matching the beginning of a line, you can apply your regexp line by line.

    However, I'll tell you something more that will save you 30 to 50% of execution time:

    Remove "use English;" and change $INPUT_RECORD_SEPARATOR into $/. This will avoid the malicious influence of the so called evil variables ($`, $&, $').

    I have actually benchmarked your script with a few hundred lines of logs, and this change saves about 40% of execution time on my laptop.

      I'll take your advice and pull out the "use english" command. I thought the "use english" syntax would be the kind thing to do for the benefit of my coworkers. They cringe when they see a complex regex and a "$/" will have them running to the camel book! Please elaborate on the "evil variables". I'm not really familiar with them and their implications.

        Sometimes, a comment will work just as well, without the overhead. Consider the following:

        local $/; # INPUT_RECORD_SEPARATOR

        As another option, in the documenation for 'English', there's a suggestion to use:

        use English qw( -no_match_vars ) ;

        to prevent the problems mentioned.

        As for the 'evil variables', the following note is in 'perldoc perlre':

        WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, $` or $', then patterns without capturing parentheses will not be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.

        They cringe when they see a complex regex and a "$/" will have them running to the camel book!

        Add a comment?

        local $/ = $my_input_sep; # Custom input separator (instead of \n)

        Yoda would agree with Perl design: there is no try{}

        Further to the other excellent responses, you can check if your script is infected with any of the evil variables with the Devel::SawAmpersand module.

Re: Pimp My RegEx
by demerphq (Chancellor) on May 31, 2005 at 18:51 UTC

    Your regex is really horribly inefficient. Think about what it has to do to match that pattern. A much simpler way to do the same thing is:

    #!/usr/local/bin/perl use warnings; use strict; use English; use Data::Dumper; use Time::HiRes 'time'; my $logfile; my $count = 0; # Initialize counter my $start = time(); # Start the timer my $fullrec; while( <DATA>) { if (/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/) { $count++; #if ($fullrec) { # process($fullrec); #} $fullrec = $_; } else { $fullrec .= $_; } } #if (defined($fullrec)) { # process($fullrec); #} my $end = time(); # Stop the timer my $elapsed = $end - $start; # How long did that take? my $average = $elapsed/$count; # Average processing time printf "Parsed $count log file entries in %.4f seconds, averaging %.4f +\n", $elapsed, $average; exit; __DATA__ <SNIPPED DATA>

    Ie you know that when a line starts with a data that it begins a record, this also implys it denotes the end of the previous record (if such a record exists). Thus you simply need to check each line to see if it begins the record, and then do something with the previous one that you have constructed. This also means that the pattern is anchored and only needs to compare Lr (the length of pattern) chars per line instead of the the Lf (the length of the file) times Lr that your code would do (with the look ahead assertion).

    So your pattern does something like 55000000*23 character lookups, even worse most of those will be char class lookups so they are inefficient to start with. If you use the line by line approach you are dealing with 366000*23 lookups. Thats a LOT less. (Actually these are upper bounds, but i think the point is made)

    Update:Fixed as per ikegami's reply.

    ---
    $world=~s/war/peace/g

      That should be:
      while( <DATA>) { if (/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/) { $count++; #if (defined($fullrec)) { # process($fullrec); #} $fullrec = $_; } else { $fullrec .= $_; } } #if (defined($fullrec)) { # process($fullrec); #}

      or look at my solution to eliminate this redundancy.

        Indeed. Good catch. :-)

        ---
        $world=~s/war/peace/g

      Thanks for sharing your wisdom! I can't believe how quickly you guys came up w/ replies. The PerlMonks community is a great resource!

      I wanted to use file slurp mode because so that could MOVE old log file entries from the current log into a seperate monthly log file. (and then tar/gzip 'em)

      I thought that by using the slurp mode, I could essentially edit the file in place without having to create a temporary file. The file system these logs live in is often constrained for space.

      Thanks again!

        I thought that by using the slurp mode, I could essentially edit the file in place without having to create a temporary file.
        No, that is unsound. You can corrupt your log files. Consider:
        1. Slurp the log file.
        2. Apply regex to file contents.
        3. Write the log file with new file contents.
        Now, if the write fails due to "disk full" (a distinct possibility in your environment) or because you lose power at that instant, or for any old reason, you have just corrupted your log file. Generally, you should write the modified file to a temporary file and rename the temporary file over the original only after checking that the temporary file was created without error.

        I could essentially edit the file in place without having to create a temporary file.

        Generally this isn't possible. You can only overwrite the existing parts of the log, not slice them out of the file. The normal way to do what you want is to process your input records as shown here by me and several of the other monks and then process them and output them to the new file, and then delete the old file. You can't really remove from the beginning of a file while something is writing to the end.

        Anyway, glad to help. :-)

        ---
        $world=~s/war/peace/g

Re: Pimp My RegEx
by dragonchild (Archbishop) on May 31, 2005 at 18:35 UTC
    You don't need to slurp. Either the line matches the regex /^$date_regex/ or it doesn't. (I'm assuming there's no chance the middle of a record will match that regex.)
    perl -n -e '$count++ if /^REGEX/;}END{print "$count\n";' FILENAME

    And, yes, 55mb will blow out a 3G+ RAM. You should expect a 10-1 to 20-1 increase in memory usage over disk size. This is Perl ...


    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
        I was unclear - you wouldn't need the whole regex anymore. The problem has been restated, so the new regex should be more restrictive. At the very minimum, it would not be capturing across $/.

        • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
        • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
Re: Pimp My RegEx
by ikegami (Patriarch) on May 31, 2005 at 18:55 UTC

    The following uses a simpler regexp, and it doesn't slurp the file into memory. It's probably faster:

    #!/usr/local/bin/perl use warnings; use strict; my $record_start_regexp = qr/^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/; sub do_something { print("[\n"); print($_[0]); print("]\n"); } { local $_; my $data; for (;;) { $_ = <DATA>; last if not defined $_; last if /$record_start_regexp/; } while (defined) { $data .= $_; $_ = <DATA>; if (!defined || /$record_start_regexp/) { do_something($data); $data = ''; } } }
Re: Pimp My RegEx
by Roy Johnson (Monsignor) on May 31, 2005 at 18:47 UTC
    I get better performance with this:
    my $parse_log_entry = qr/^$dateRegex/ms; ... my $prevpos = 0; while( $logfile =~ m/$parse_log_entry/g ) { $count++; # count number of parsed strings if ($prevpos) { # print "Do something with $prevpos .. $-[0]\n"; # print substr($logfile, $prevpos, $-[0]-$prevpos), "\n"; } $prevpos = $-[0]; }
    You would use the substr expression above to access what you're currently capturing, though you don't indicate that what you're capturing is actually used at all. .

    Caution: Contents may have been coded under pressure.

      You would use the substr expression above to access what you're currently capturing, though you don't indicate that what you're capturing is actually used at all.

      I made a grossly simplified version of my script to share with the forum. Posting the script in its entirety would have just bored folks to tears! ...but rest assured, I am doing something with the captured text.
      I'm not sure your solution will work for log entries that span multiple lines.

        I'm not sure your solution will work for log entries that span multiple lines.
        Why not?

        Caution: Contents may have been coded under pressure.
Re: Pimp My RegEx
by tlm (Prior) on May 31, 2005 at 18:35 UTC

    I'm trying to understand the rationale behind your regex. Don't you want something more like

    my $parse_log_entry = qr/^($dateRegex.*?)(?=(?:^$dateRegex|\z))/ms;
    ? If so, then it I would process the file line-wise, and count only those lines that begin with your date pattern.

    Or, if you need to do more than just counting, then something along the lines of

    my $current; while ( <DATA> ) { if ( /^$dateRegex/ ) { process( $current ) if defined $current; $current = $_; } else { $current .= $_; } } process( $current ) if defined $current;

    Update: Added the more extended second alternative.

    Update 2: Fixed bug (last line of code was missing).

    the lowliest monk

      For the sake of brevity, I cut out my "payload" and just counted lines. I counted lines just to benchmark the regex times. My rationale was something like this: "find a line that starts with a date, capture all of the text (including new lines) until I can lookahead and find another date."

        Yes, but as originally written your regex would miss the last log entry, because it fails the positive lookahead assertion.

        Regarding the payload, please see the update to my original reply.

        Update: Added further comments on the error in the original regex.

        the lowliest monk

Re: Pimp My RegEx
by heathen (Acolyte) on May 31, 2005 at 21:11 UTC
    Once again, thanks for the suggestions. My last benchmark showed 179,963 log file entries parsed in 7.17 seconds. Not bad! ....now time to go back and add the File I/O operations.
      If you're using Perl 5.8.0 or (preferably) later, check out PerlIO::gzip, which makes it really simple and easy to have your output written in gzip-compressed form.

      UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).

Re: Pimp My RegEx
by TedPride (Priest) on May 31, 2005 at 19:22 UTC
    If disk space is so low that you can't fit a second copy of the file, you need more disk space. Barring that, you could try creating a temporary RAM disk, copying the file into it, reading the file line by line, then writing back an interpreted copy to the original location. Reading from a RAM disk is significantly faster than reading from the hard disk, and the file will only require its actual size in memory instead of its Perl size.

    (be sure to create a backup, compressed copy of the file first, however - there's always a tiny chance your comp will crash or something will go wrong, and the copy of the file in the RAM disk will go poof)

Re: Pimp My RegEx
by jcoxen (Deacon) on Jun 01, 2005 at 14:25 UTC
    I have to do stuff like this fairly often with a bunch of different format files so I just keep this around as a generic framework. I modify the regex statements, delimiter and number of fields as necessary but it's pretty much the same program every time.

    Here's what I use - lifted almost completely from various posts here on PM...

    #!/usr/bin/perl use warnings; use strict; open INFILE, "< $ARGV[0]" or die "Could not open file $ARGV[0]: $!"; until(eof INFILE) { my $fn = $ARGV[0]; open OUTFILE, "> $fn.filtered" or die "Could not open file $fn.filtered $!"; while(<INFILE>) { my @line = split /\t/; $line[0] =~ s/(.{1,6}).*/$1/; $line[3] =~ s/(.{1,6}).*/$1/; $line[6] = $line[6]."\n"; $_ = join("\t", $line[0], $line[1], $line[2], $line[3], $line[ +4], $line[5], $line[6]); print OUTFILE; } }
    BTW, this code processed an 8 million line/385 mb file in about 33 minutes the other day on my old, anemic Sun Ultra 5 workstation. No brownouts, no sweat. System load never got above 1.05.

    Jack

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://462148]
Approved by injunjoel
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-24 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found