Count number of occurrences of a list of words in a file

Azaghal has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Count number of occurrences of a list of words in a file
by Athanasius (Archbishop) on May 09, 2018 at 16:11 UTC

The bottleneck occurs here, in the inner loop:

while (my $line = <$fh>) {
    chomp $line;
    foreach my $mot (keys (%count)) {
    chomp $mot;        
    foreach my $str ($line =~ /$mot/g) {
        $count{$str}++;
    }
    }    
}
[download]

If %count contains 60,000 entries, then the foreach loop performs 60,000 regex tests against each line of the input text file! Fortunately, this is quite unnecessary. I would split each line into words and simply lookup these words in the hash; like this (untested):

while (my $line = <$fh>)
{
    chomp $line;

    my @words = split /\W+/, $line;

    for my $word (@words)
    {
        ++$count{$word} if exists $count{$word};
    }
}
[download]

(You may need to tweak the split regex, depending on the contents of the words in the list file.)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Count number of occurrences of a list of words in a file

by Tux (Canon) on May 09, 2018 at 16:31 UTC

How about this then?

#!/usr/bin/perl

use 5.18.3;
use warnings;

chomp (my @words = <DATA>);
my %cnt = map { $_ => 0 } @words;

foreach my $tf (glob "*.log") {
    printf STDERR " %-40s\r", $tf;
    open my $fh, "<", $tf or next;
    while (<$fh>) {
        $cnt{$_}++ for grep { exists $cnt{$_} } m/(\w+)/g;
        }
    }

printf "%-20s : %6d\n", $_, $cnt{$_} for sort { $cnt{$b} <=> $cnt{$a} 
+} @words;
__END__
tux
dromedary
camel
dream
milk
druid
monk
wizard
azaghal
perl
[download]

I was more wondering about case sensitiveness

I timed that against the original approach. My code (with 8 words): 19 seconds, OP code: 57 seconds. That difference will exponentially grow with the number of words: with 23 words 20 seconds versus 175 seconds. Total text size was 273 Mb.

Enjoy, Have FUN! H.Merijn

[reply]
[d/l]

Re: Count number of occurrences of a list of words in a file (updated)
by AnomalousMonk (Archbishop) on May 09, 2018 at 16:32 UTC

You're enabling warnings and strictures, and that's very good, but in the OPed code you have two variables that are undeclared (that I can see): @listes and $file (update: now fixed). This code will not compile, and it's very important to present compilable code to the monks lest they grow grumpy. Please see Short, Self-Contained, Correct Example.

That said, the word counting loop

while (my $line = <$fh>) {
    chomp $line;
    foreach my $mot (keys (%count)) {
    chomp $mot;        
    foreach my $str ($line =~ /$mot/g) {
        $count{$str}++;
    }
    }    
}
[download]

Athanasius

here

(update: and almost-as-fleet-fingered Tux's here :)

split

exclude

$rx_word

extract

%count

untested

my $rx_word = qr{ \b [[:alpha:]]+ \b }xms;  # a very naīve word!

while (my $line = <$fh>) {
    exists $count{$_} and ++$count{$_} for $line =~ m{ $rx_word }xmsg;
    }
[download]

$rx_word

chomp

(I would have suggested the technique described in haukex's Building Regex Alternations Dynamically, ~~but I suspect your word list is so big that it would capsize the regex compiler~~ | see Update below. But you might try it anyway; it might be faster ~~if it works at all~~.)

Update: I had thought that there was a hard limit to regex alternations that would cause compilation to fail, but it seems there is not — or if there is, it's much greater than 60K words! What I may have been thinking of is a limit to trie-optimization that causes a fallback to literal ordered alternation at some point. Given that that's the case, I would encourage you to try the dynamic alternation technique. I now expect it to work, and the only question is if it has a speed advantage.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^2: Count number of occurrences of a list of words in a file

by Azaghal (Novice) on May 11, 2018 at 15:33 UTC

Hi,

Thanks for your reply !

I've edited the faulty variables out, I went too fast making them more readable. It should be working code now.

While the other answers were good, this one is the best for my particular need, as I needed to define quite precisely what should match or not, so that excluding the rest was not a good option, even if it's quicker.

[reply]

Re^3: Count number of occurrences of a list of words in a file

by AnomalousMonk (Archbishop) on May 11, 2018 at 19:44 UTC

Thank you for the compliment, and I'm glad that my suggestion was helpful to you.

I'm curious about your ultimate solution. As pointed out by Veltro here, the methods used by Athanasius, Tux and myself for enumeration of potential words are essentially identical. The differences in approach are between the split/exclusion and regex/extraction (as I would characterize them) methods used for finding candidate "words." Were you able to define a $rx_word regex object that had relatively few false positives (and, of course, absolutely no false negatives)? If so, it would be interesting to know what this definition is if it isn't so specific to your application as to be meaningless to others, or too proprietary.

It would be of even greater interest to me if you were able to get the Building Regex Alternations Dynamically approach working and if it is advantageous in terms of speed. As I mentioned in my reply (now with more updates!), my expectation was that a 60K word list was too big to be encompassed by a regex alternation; I no longer believe this. If you were able to use this technique and it proved beneficial, I'd like to hear about it!

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: Count number of occurrences of a list of words in a file
by Veltro (Hermit) on May 09, 2018 at 22:15 UTC

In the the three solutions that where posted, everyone uses 'exists $hash{key}' and a '$hash{key}=...' and to me these look like two look-ups in the hash to me. Is this efficient? Can this be more efficient?

Athanasius

++$count{$word} if exists $count{$word};

Tux

$cnt{$_}++ for grep { exists $cnt{$_} } m/(\w+)/g;

AnomalousMonk

exists $count{$_} and ++$count{$_} for $line =~ m{ $rx_word }xmsg;

[reply]
[d/l]
[select]

Re^2: Count number of occurrences of a list of words in a file

by Cristoforo (Curate) on May 09, 2018 at 22:28 UTC

Athanasius

here

[reply]

Re^2: Count number of occurrences of a list of words in a file

by AnomalousMonk (Archbishop) on May 09, 2018 at 22:27 UTC

My assumption was that there might be many things in Azaghal's input textfile.txt that look like "words", and he or she only wanted to count the words specified in the list.txt file. If that's the case, one must check that a "word" exists before incrementing it else one will autovivify a "word" that was not previously present. Hence, two hash accesses are necessary.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks