Re: Count number of occurrences of a list of words in a file (updated)

You're enabling warnings and strictures, and that's very good, but in the OPed code you have two variables that are undeclared (that I can see): @listes and $file (update: now fixed). This code will not compile, and it's very important to present compilable code to the monks lest they grow grumpy. Please see Short, Self-Contained, Correct Example.

That said, the word counting loop

while (my $line = <$fh>) {
    chomp $line;
    foreach my $mot (keys (%count)) {
    chomp $mot;        
    foreach my $str ($line =~ /$mot/g) {
        $count{$str}++;
    }
    }    
}
[download]

jumps out at me. My thought was similar to fleet-fingered Athanasius's here (update: and almost-as-fleet-fingered Tux's here :), but rather than using split to exclude what is not wanted, I'd suggest defining a pattern $rx_word to extract anything that looks like a word that you might want to count. If the extracted word is in the pre-existing %count hash, count it. Something like (untested):

my $rx_word = qr{ \b [[:alpha:]]+ \b }xms;  # a very naīve word!

while (my $line = <$fh>) {
    exists $count{$_} and ++$count{$_} for $line =~ m{ $rx_word }xmsg;
    }
[download]

Obviously, the proper definition of $rx_word is critical! Only you can determine what this proper definition is. (If you define it right, you don't even need to bother chomp-ing anything.)

(I would have suggested the technique described in haukex's Building Regex Alternations Dynamically, ~~but I suspect your word list is so big that it would capsize the regex compiler~~ | see Update below. But you might try it anyway; it might be faster ~~if it works at all~~.)

Update: I had thought that there was a hard limit to regex alternations that would cause compilation to fail, but it seems there is not — or if there is, it's much greater than 60K words! What I may have been thinking of is a limit to trie-optimization that causes a fallback to literal ordered alternation at some point. Given that that's the case, I would encourage you to try the dynamic alternation technique. I now expect it to work, and the only question is if it has a speed advantage.

Give a man a fish: <%-{-{-{-<

Comment on Re: Count number of occurrences of a list of words in a file (updated) Select or Download Code

Replies are listed 'Best First'.
Re^2: Count number of occurrences of a list of words in a file by Azaghal (Novice) on May 11, 2018 at 15:33 UTC
Hi, Thanks for your reply ! I've edited the faulty variables out, I went too fast making them more readable. It should be working code now. While the other answers were good, this one is the best for my particular need, as I needed to define quite precisely what should match or not, so that excluding the rest was not a good option, even if it's quicker.	[reply]
Re^3: Count number of occurrences of a list of words in a file by AnomalousMonk (Archbishop) on May 11, 2018 at 19:44 UTC
Thank you for the compliment, and I'm glad that my suggestion was helpful to you. I'm curious about your ultimate solution. As pointed out by Veltro here, the methods used by Athanasius, Tux and myself for enumeration of potential words are essentially identical. The differences in approach are between the split/exclusion and regex/extraction (as I would characterize them) methods used for finding candidate "words." Were you able to define a `$rx_word` regex object that had relatively few false positives (and, of course, absolutely no false negatives)? If so, it would be interesting to know what this definition is if it isn't so specific to your application as to be meaningless to others, or too proprietary. It would be of even greater interest to me if you were able to get the Building Regex Alternations Dynamically approach working and if it is advantageous in terms of speed. As I mentioned in my reply (now with more updates!), my expectation was that a 60K word list was too big to be encompassed by a regex alternation; I no longer believe this. If you were able to use this technique and it proved beneficial, I'd like to hear about it! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks