Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re^2: Word Frequency in Particular Sentences

by papidave (Pilgrim)
on Mar 28, 2008 at 11:54 UTC ( #676953=note: print w/replies, xml ) Need Help??

in reply to Re: Word Frequency in Particular Sentences
in thread Word Frequency in Particular Sentences

swampyankee++ for noticing the problem with abbreviations. Short of a the ability to parse and comprehend grammar, it's going to be very difficult to separate

"We sold the division to MegaTech, Ltd. in Asia last week, who flipped the sale to someone else."
"We sold the division to MegaTech Industries. In Asia last week, they flipped the sale to someone else."
other than the fact that we are supposed to start a new sentence with an upper-case letter. There may be examples where that following word is a proper noun, however -- in which case it's going to be a very hard nut to crack.

If, however, you only care about the "typical" case (because this is going to be a one-shot tool), you could:

  1. Split the text on /[.]\s+[A-Z]/ to get sentences.
  2. Grep the text for /[aA]sia/, or for /Asia\s/ if you don't want the word "asian" to count.
  3. Split the sentences that pass on ' ' to get words.
  4. Use the words you get from that split as keys to a hash, and increment a count in each bin.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://676953]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2022-10-04 22:30 GMT
Find Nodes?
    Voting Booth?
    My preferred way to holiday/vacation is:

    Results (19 votes). Check out past polls.