Multiple patterns match in a big file and track the counts of each pattern matched

ansh007 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multiple patterns match in a big file and track the counts of each pattern matched by choroba (Cardinal) on Nov 28, 2017 at 12:31 UTC
Instead of matching several times, create a larger regex and match just once. In order to remember what part matched, I created a named capture groups and retrieved the names from the special %- hash. Without your data, I can't test whether it's faster or not. `#!/usr/bin/perl use warnings; use strict; ... my @pat_array = split /@@@/, $InListOfPatterns; my $i; my $regex = join '\|', map +($i++, "(?<m$i>$_)")[1], map quotemeta, @pa +t_array; open my $LOG,'<', $InLogFilePath or die "can not open file :$!"; my %matched; while (<$LOG>) { chomp; if ( $. > $InStartLineNumber) { $matched{ (grep defined $-{$_}[0], keys %-)[0] }++ if /$regex/ +; } } close $LOG; for my $pattern (keys %matched) { print "$pattern\t$matched{$pattern}\n"; }` [download] Beware! You have a precedence issue in your code: `open LOG_READ,'<',"$InLogFilePath" \|\| die "can not open file :$!";` [download] This will never die if you can't open the file, it will only die if the file name is empty. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: Multiple patterns match in a big file and track the counts of each pattern matched by ansh007 (Novice) on Dec 04, 2017 at 11:07 UTC
I tried this. It works, but it takes more than 3 and a half minutes. I guess because of grep, but i am not sure. Thanks for your time.	[reply]
Re^3: Multiple patterns match in a big file and track the counts of each pattern matched by choroba (Cardinal) on Dec 04, 2017 at 11:41 UTC
Yes, that's probably the reason. What about embedded code? `my %matched; my $i; my $regex = join '\|', map +($i++, "$_(?{\$matched{$i}++})")[1], map qu +otemeta, @pat_array; open my $LOG,'<', $InLogFilePath or die "can not open file :$!"; while (<$LOG>) { use re 'eval'; chomp; /$regex/ if $. > $InStartLineNumber; } close $LOG; for my $pattern (keys %matched) { print "$pattern\t$matched{$pattern}\n"; }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Multiple patterns match in a big file and track the counts of each pattern matched by hippo (Bishop) on Nov 28, 2017 at 11:27 UTC
I need to match the patterns in the 1GB+ log file ... but it takes long, close to 2 mins, for 3 patterns So, profile it. How much of that 2 mins is just simply reading the massive logfile before you do anything with it? Only once you have determined that the pattern match is the slow part should you consider taking further action regarding it. If the pattern match does turn out to be a big contributor to the runtime, use index instead since you are actually matching substrings and not regexen.	[reply]
Re^2: Multiple patterns match in a big file and track the counts of each pattern matched by ansh007 (Novice) on Nov 28, 2017 at 11:35 UTC
It takes 10 to 11secs to parse the whole file. That's how I knew matching takes time. If I give only one pattern, it takes 1min+ and slowly it increases. I need to make sure, it doesn't take much memory or CPU and it must run within 30-35 secs.	[reply]
Re: Multiple patterns match in a big file and track the counts of each pattern matched by siberia-man (Friar) on Nov 28, 2017 at 12:33 UTC
I think you can significantly improve performance separating the logic responsible for skipping and counting. The next step is to rework the counting code using builtin features of Perl. Finally you can try to apply this approach in your script with your data. It would be very interesting to know results of this. Try to apply the following code (please be noticed that it is not complete, it just shows the concept of the approach described above). I have commented it enough to understand what happens on each step. Some mandatory parts are omitted to emphasize key moments of the approach. You need add them in a final version before starting your tests. # initialize the array of patterns # the same code as you use in your script, just complete the line my @pat_array = ...; # this is new hash variable used for counting matches # it used entirely instead your approach my %match_count; # skip first lines # simple read them and do nothing over them <LOG_READ> for ( 1..$InStartLineNumber ); # normal work # read line by line the rest of the file and do something while ( <LOG_READ> ) { # read the line, and store in the variable explicitly chomp; my $line = $_; # walk through the list of patterns # test the line for matching each pattern # and count every successful match in the hash map { $line =~ m/\Q$_\E/ and $match_count{$_}++; } @pat_array; } # The rest of code handling with @pat_array and %match_count [download]	[reply] [d/l]
Re^2: Multiple patterns match in a big file and track the counts of each pattern matched by ansh007 (Novice) on Dec 04, 2017 at 11:11 UTC
Thank you so much for such a detailed explanation and the piece of code. It works as expected, but takes similar time to my code. Mine takes 1min 35 secs and this takes 1min 32secs. Can you please help me to optimize it at least up to 40 secs ? waiting for your response :)	[reply]
Re^3: Multiple patterns match in a big file and track the counts of each pattern matched by siberia-man (Friar) on Dec 04, 2017 at 19:59 UTC
Definitely, 1GB file is quite huge! Do you really think that it is possible to improve the performance in this case? Any way there are two another hints given by other monks: 1) use `index` or 2) combine few small regexps into the bigger one. Also you can remove the part creating the regexps out of the loop: create regexps before looping and use "compiled" regexps within the loop.	[reply] [d/l]
Re: Multiple patterns match in a big file and track the counts of each pattern matched by duff (Parson) on Nov 29, 2017 at 14:13 UTC
Your use-case sounds exactly like what Regexp::Assemble was invented for. To quote the docs: Regexp::Assemble takes an arbitrary number of regular expressions and assembles them into a single regular expression (or RE) that matches all that the individual REs match. As a result, instead of having a large list of expressions to loop over, a target string only needs to be tested against one expression. This is interesting when you have several thousand patterns to deal with. Serious effort is made to produce the smallest pattern possible. It is also possible to track the original patterns, so that you can determine which, among the source patterns that form the assembled pattern, was the one that caused the match to occur. duff	[reply]
Re: Multiple patterns match in a big file and track the counts of each pattern matched by haukex (Archbishop) on Dec 04, 2017 at 13:18 UTC
I'm a bit late to the party but I noticed this thread is still active. Here's my contribution, which is based on and very similar to choroba's idea but using %+ instead of %-, and adding siberia-man's suggestion for skipping lines of the file before the main loop. The idea of the following code is that since we're constructing the regex ourselves, we know that only one named capture group `(?<mN>...)` will match at a time, and so `keys %+` should only ever return one value, from which we extract the digits `N`. As for why I sort the strings by length, see the tutorial Building Regex Alternations Dynamically. Another thing to note is that if multiple patters could match on a single line, only the first one is matched; I'm not sure if that's acceptable in your case? It would also be possible to modify the code to find all matches on a single line with the `/g` modifier. use warnings; use strict; my @pat_array = sort { length $b <=> length $a } qw/ foo ba baz quzz /; my $InStartLineNumber = 2; # nr. of lines to skip my $i=0; my ($regex) = map {qr/$_/} join '\|', map { '(?<m'.$i++.'>'.quotemeta.')' } @pat_array; # pre-sorted above my @match_count = (0) x @pat_array; <DATA> for 1..$InStartLineNumber; while (<DATA>) { if ($_=~$regex) { $match_count[ substr( (keys %+)[0], 1 ) ]++; } } for my $i (0..$#pat_array) { print $pat_array[$i],": ",$match_count[$i],"\n"; } __DATA__ Skip me foo Skip me bar Hello foo World bar foo bar baz foo quz [download] Output: `quzz: 0 foo: 3 baz: 1 ba: 1` [download] I haven't yet benchmarked this against a big file, but give it a try. The above code assumes that you need your output in `@match_count` as you showed. If other data structures are acceptable, note the code can be simplified even more by using a single capture group and a hash, as follows. The set-up code and `__DATA__` section is the same as the above. `my ($regex) = map {qr/($_)/} join '\|', map {quotemeta} @pat_array; my %match_count; <DATA> for 1..$InStartLineNumber; while (<DATA>) { if ($_=~$regex) { $match_count{$1}++; } } for my $k (sort keys %match_count) { print $k,": ",$match_count{$k},"\n"; }` [download] Output: `ba: 1 baz: 1 foo: 3` [download] One more thought: You haven't said why you need to skip lines in the file, but if the number of lines you're skipping is large, then of course that will take some time. If the amount of data you want to skip is somehow predictable, you could seek ahead in the file, this would be much faster. For example, say you have already processed a set of lines from the beginning of the file, and now you want to process the rest of the file, then I would suggest that the code which processes the first part of the file should record where it stopped (tell), so you can then seek to that position.	[reply] [d/l] [select]
Re: Multiple patterns match in a big file and track the counts of each pattern matched by Anonymous Monk on Nov 28, 2017 at 18:39 UTC
Consider File::Map ... map the large file into memory and then treat it as a giant string.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


go ahead... be a heretic
	PerlMonks