Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble)

by grinder (Bishop)
on Apr 13, 2005 at 16:49 UTC ( [id://447501]=note: print w/replies, xml ) Need Help??


in reply to Efficient Way to Parse a Large Log File with a Large Regex

Is creating a regex, like the one discussed above, going to be the most efficient way?

It would be if you used Regexp::Assemble :) The code would look something like

use strict; use Regexp::Assemble; my $re = do { open IN, shift || 'file_of_IPs_sought' or die $!; my $guts = Regexp::Assemble->new->add( map { chomp; quotemeta($_) } <IN> )->as_string; close IN; qr/\b$guts\b/ }; open LOGFILE, shift || 'logfile' or die $!; /$re/ and print while <LOGFILE>; close LOGFILE; # update: if this is a pipe... /$re/ and print while <>;

The expression will probably turn out to be about the same size as the list of IPs. The more they cluster, the smaller the pattern will be. And 500 patterns will barely have Regexp::Assemble breaking a sweat.

- another intruder with the mooring in the heart of the Perl

  • Comment on Re: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble)
  • Download Code

Replies are listed 'Best First'.
Re^2: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble)
by BrowserUk (Patriarch) on Apr 13, 2005 at 17:12 UTC

    But it will be hugely slower than doing a simple search to find the embedded IP and then look that up in a hash that contains the 500 IPs in question.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?

      It all comes down to the difference between

      /$re/ and print while <>
      and
      while( <> ) { while( /\b(\d+\.\d+\.\d+\.\d+)\b/g ) { if( exists $ip{$1} ) { print; last; } } }

      Hugely slower? No. A quick benchmark here shows that the regular expression appoach is about twice as slow (and we are talking about a problem dominated by disk I/O anyway). One factor depends on how many naked IPs appear on a line. If there are several and only one interests you, the direct regexp will pick it up immediately, whereas the hash approach will have to test each one.

      Another consideration is that if you want to extend the approach to search for e.g. 192.168.0.* then you can no longer use the hash approach at all, since what gets matched does not correspond to any key.

      Or else I completely misread the question, in which case consider my solution withdrawn.

      - another intruder with the mooring in the heart of the Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://447501]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-25 20:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found