Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Count number of occurrences of a list of words in a file

by Azaghal (Novice)
on May 09, 2018 at 15:37 UTC ( [id://1214283]=perlquestion: print w/replies, xml ) Need Help??

Azaghal has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got a list of words (~60k words) and a big text file (~40M lines).

I would like to find the number of occurrences of each word from the list in the text file.

So far I've tried to put the list in a hash, go through each line of the text and increment the associated value each time I see it in the text (with an really basic regex).

Problem is, my code seems to be running fine on a small sample, but is taking ages with the real files. How can I improve my code ?

use strict; use warnings; use utf8::all; my $listfile = "list.txt"; open IN, '<', $listfile or die "could not open $listfile"; binmode(IN, ":utf8"); open OUT, '>', "result.txt" or die "could not open out"; binmode(OUT, ":utf8"); my %count; my @lists = (); while (my $line1 = <IN>) { chomp (my $key1 = (split /\t/, $line1)[0]); push(@lists,$key1); } close IN; foreach my $list (@lists) { $count{$list} = 0; } my $textfile = "textfile.txt"; open my $fh, '<', $textfile or die "Could not open '$textfile' $!"; while (my $line = <$fh>) { chomp $line; foreach my $mot (keys (%count)) { chomp $mot; foreach my $str ($line =~ /$mot/g) { $count{$str}++; } } } foreach my $word (reverse sort { $count{$a} <=> $count{$b} } keys %cou +nt) { print OUT "$word\t$count{$word}\n"; }

Replies are listed 'Best First'.
Re: Count number of occurrences of a list of words in a file
by Athanasius (Archbishop) on May 09, 2018 at 16:11 UTC

    Hello Azaghal,

    The bottleneck occurs here, in the inner loop:

    while (my $line = <$fh>) { chomp $line; foreach my $mot (keys (%count)) { chomp $mot; foreach my $str ($line =~ /$mot/g) { $count{$str}++; } } }

    If %count contains 60,000 entries, then the foreach loop performs 60,000 regex tests against each line of the input text file! Fortunately, this is quite unnecessary. I would split each line into words and simply lookup these words in the hash; like this (untested):

    while (my $line = <$fh>) { chomp $line; my @words = split /\W+/, $line; for my $word (@words) { ++$count{$word} if exists $count{$word}; } }

    (You may need to tweak the split regex, depending on the contents of the words in the list file.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      How about this then?

      #!/usr/bin/perl use 5.18.3; use warnings; chomp (my @words = <DATA>); my %cnt = map { $_ => 0 } @words; foreach my $tf (glob "*.log") { printf STDERR " %-40s\r", $tf; open my $fh, "<", $tf or next; while (<$fh>) { $cnt{$_}++ for grep { exists $cnt{$_} } m/(\w+)/g; } } printf "%-20s : %6d\n", $_, $cnt{$_} for sort { $cnt{$b} <=> $cnt{$a} +} @words; __END__ tux dromedary camel dream milk druid monk wizard azaghal perl

      I was more wondering about case sensitiveness

      I timed that against the original approach. My code (with 8 words): 19 seconds, OP code: 57 seconds. That difference will exponentially grow with the number of words: with 23 words 20 seconds versus 175 seconds. Total text size was 273 Mb.


      Enjoy, Have FUN! H.Merijn
Re: Count number of occurrences of a list of words in a file (updated)
by AnomalousMonk (Archbishop) on May 09, 2018 at 16:32 UTC

    You're enabling warnings and strictures, and that's very good, but in the OPed code you have two variables that are undeclared (that I can see): @listes and $file (update: now fixed). This code will not compile, and it's very important to present compilable code to the monks lest they grow grumpy. Please see Short, Self-Contained, Correct Example.

    That said, the word counting loop

    while (my $line = <$fh>) { chomp $line; foreach my $mot (keys (%count)) { chomp $mot; foreach my $str ($line =~ /$mot/g) { $count{$str}++; } } }
    jumps out at me. My thought was similar to fleet-fingered Athanasius's here (update: and almost-as-fleet-fingered Tux's here :), but rather than using split to exclude what is not wanted, I'd suggest defining a pattern $rx_word to extract anything that looks like a word that you might want to count. If the extracted word is in the pre-existing %count hash, count it. Something like (untested):
    my $rx_word = qr{ \b [[:alpha:]]+ \b }xms; # a very naïve word! while (my $line = <$fh>) { exists $count{$_} and ++$count{$_} for $line =~ m{ $rx_word }xmsg; }
    Obviously, the proper definition of $rx_word is critical! Only you can determine what this proper definition is. (If you define it right, you don't even need to bother chomp-ing anything.)

    (I would have suggested the technique described in haukex's Building Regex Alternations Dynamically, but I suspect your word list is so big that it would capsize the regex compiler | see Update below. But you might try it anyway; it might be faster if it works at all.)

    Update: I had thought that there was a hard limit to regex alternations that would cause compilation to fail, but it seems there is not — or if there is, it's much greater than 60K words! What I may have been thinking of is a limit to trie-optimization that causes a fallback to literal ordered alternation at some point. Given that that's the case, I would encourage you to try the dynamic alternation technique. I now expect it to work, and the only question is if it has a speed advantage.


    Give a man a fish:  <%-{-{-{-<

      Hi,

      Thanks for your reply !

      I've edited the faulty variables out, I went too fast making them more readable. It should be working code now.

      While the other answers were good, this one is the best for my particular need, as I needed to define quite precisely what should match or not, so that excluding the rest was not a good option, even if it's quicker.

        Thank you for the compliment, and I'm glad that my suggestion was helpful to you.

        I'm curious about your ultimate solution. As pointed out by Veltro here, the methods used by Athanasius, Tux and myself for enumeration of potential words are essentially identical. The differences in approach are between the split/exclusion and regex/extraction (as I would characterize them) methods used for finding candidate "words." Were you able to define a $rx_word regex object that had relatively few false positives (and, of course, absolutely no false negatives)? If so, it would be interesting to know what this definition is if it isn't so specific to your application as to be meaningless to others, or too proprietary.

        It would be of even greater interest to me if you were able to get the Building Regex Alternations Dynamically approach working and if it is advantageous in terms of speed. As I mentioned in my reply (now with more updates!), my expectation was that a 60K word list was too big to be encompassed by a regex alternation; I no longer believe this. If you were able to use this technique and it proved beneficial, I'd like to hear about it!


        Give a man a fish:  <%-{-{-{-<

Re: Count number of occurrences of a list of words in a file
by Veltro (Hermit) on May 09, 2018 at 22:15 UTC

    In the the three solutions that where posted, everyone uses 'exists $hash{key}' and a '$hash{key}=...' and to me these look like two look-ups in the hash to me. Is this efficient? Can this be more efficient?

    Athanasius

    ++$count{$word} if exists $count{$word};

    Tux

    $cnt{$_}++ for grep { exists $cnt{$_} } m/(\w+)/g;

    AnomalousMonk

    exists $count{$_} and ++$count{$_} for $line =~ m{ $rx_word }xmsg;
      Athanasius identified the code slowdown here. Hash lookups are the fastest way here and generally. The 2 lookups are necessary because you have to verify that the word being checked exists in the counting hash. Otherwise, without this check, a new word (not to be searched for) would be erroneously counted in the hash.

      My assumption was that there might be many things in Azaghal's input textfile.txt that look like "words", and he or she only wanted to count the words specified in the list.txt file. If that's the case, one must check that a "word" exists before incrementing it else one will autovivify a "word" that was not previously present. Hence, two hash accesses are necessary.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1214283]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-04-23 19:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found