Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Count number of occurrences of a list of words in a file (updated)

by AnomalousMonk (Archbishop)
on May 09, 2018 at 16:32 UTC ( [id://1214288]=note: print w/replies, xml ) Need Help??


in reply to Count number of occurrences of a list of words in a file

You're enabling warnings and strictures, and that's very good, but in the OPed code you have two variables that are undeclared (that I can see): @listes and $file (update: now fixed). This code will not compile, and it's very important to present compilable code to the monks lest they grow grumpy. Please see Short, Self-Contained, Correct Example.

That said, the word counting loop

while (my $line = <$fh>) { chomp $line; foreach my $mot (keys (%count)) { chomp $mot; foreach my $str ($line =~ /$mot/g) { $count{$str}++; } } }
jumps out at me. My thought was similar to fleet-fingered Athanasius's here (update: and almost-as-fleet-fingered Tux's here :), but rather than using split to exclude what is not wanted, I'd suggest defining a pattern $rx_word to extract anything that looks like a word that you might want to count. If the extracted word is in the pre-existing %count hash, count it. Something like (untested):
my $rx_word = qr{ \b [[:alpha:]]+ \b }xms; # a very naïve word! while (my $line = <$fh>) { exists $count{$_} and ++$count{$_} for $line =~ m{ $rx_word }xmsg; }
Obviously, the proper definition of $rx_word is critical! Only you can determine what this proper definition is. (If you define it right, you don't even need to bother chomp-ing anything.)

(I would have suggested the technique described in haukex's Building Regex Alternations Dynamically, but I suspect your word list is so big that it would capsize the regex compiler | see Update below. But you might try it anyway; it might be faster if it works at all.)

Update: I had thought that there was a hard limit to regex alternations that would cause compilation to fail, but it seems there is not — or if there is, it's much greater than 60K words! What I may have been thinking of is a limit to trie-optimization that causes a fallback to literal ordered alternation at some point. Given that that's the case, I would encourage you to try the dynamic alternation technique. I now expect it to work, and the only question is if it has a speed advantage.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: Count number of occurrences of a list of words in a file
by Azaghal (Novice) on May 11, 2018 at 15:33 UTC

    Hi,

    Thanks for your reply !

    I've edited the faulty variables out, I went too fast making them more readable. It should be working code now.

    While the other answers were good, this one is the best for my particular need, as I needed to define quite precisely what should match or not, so that excluding the rest was not a good option, even if it's quicker.

      Thank you for the compliment, and I'm glad that my suggestion was helpful to you.

      I'm curious about your ultimate solution. As pointed out by Veltro here, the methods used by Athanasius, Tux and myself for enumeration of potential words are essentially identical. The differences in approach are between the split/exclusion and regex/extraction (as I would characterize them) methods used for finding candidate "words." Were you able to define a $rx_word regex object that had relatively few false positives (and, of course, absolutely no false negatives)? If so, it would be interesting to know what this definition is if it isn't so specific to your application as to be meaningless to others, or too proprietary.

      It would be of even greater interest to me if you were able to get the Building Regex Alternations Dynamically approach working and if it is advantageous in terms of speed. As I mentioned in my reply (now with more updates!), my expectation was that a 60K word list was too big to be encompassed by a regex alternation; I no longer believe this. If you were able to use this technique and it proved beneficial, I'd like to hear about it!


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1214288]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-20 01:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found