Word Frequency in Particular Sentences

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Word Frequency in Particular Sentences by swampyankee (Parson) on Mar 27, 2008 at 22:04 UTC
First, look for pdf in CPAN to find a module that parses PDF files. Second, watch out for abbreviations, which usually end with a period, and sentences that end with something other than a period. Third, it's Perl, not PERL; Perl is not an acronym. emc Information about American English usage here and here. Floating point issues? Please read this before posting.	[reply]
Re^2: Word Frequency in Particular Sentences by papidave (Pilgrim) on Mar 28, 2008 at 11:54 UTC
swampyankee++ for noticing the problem with abbreviations. Short of a the ability to parse and comprehend grammar, it's going to be very difficult to separate "We sold the division to MegaTech, Ltd. in Asia last week, who flipped the sale to someone else." from "We sold the division to MegaTech Industries. In Asia last week, they flipped the sale to someone else." other than the fact that we are supposed to start a new sentence with an upper-case letter. There may be examples where that following word is a proper noun, however -- in which case it's going to be a very hard nut to crack. If, however, you only care about the "typical" case (because this is going to be a one-shot tool), you could: Split the text on `/[.]\s+[A-Z]/` to get sentences. Grep the text for `/[aA]sia/`, or for `/Asia\s/` if you don't want the word "asian" to count. Split the sentences that pass on `' '` to get words. Use the words you get from that split as keys to a hash, and increment a count in each bin. Q.E.D.	[reply] [d/l] [select]
Re: Word Frequency in Particular Sentences by nefigah (Monk) on Mar 27, 2008 at 22:20 UTC
Everybody stand back! :) As was well stated, someone has already written things to get text out of PDFs for you. The second problem of finding sentences with "Asia" in them is more interesting, and should be a good learning exercise for you. So, pretending that you already have a plain text file full of, erm, text available, how would you go about identifying asian sentences in it? Do you have an idea how you would begin? (Trying to ascertain what you already know/what you have already written) I'm a peripheral visionary... I can see into the future, but just way off to the side.	[reply]
Re^2: Word Frequency in Particular Sentences by Anonymous Monk on Mar 28, 2008 at 02:39 UTC
Let me 'fess up. I was hoping that a similar problem has been solved already by someone and I could simply adapt that. As a well aged academic economist, I am way past trying to master Perl at any deep level. Still I will be grateful if you (or someone) could assure me that this is a doable problem in Perl and maybe point to a few functions/regular expressions(?) that may be used in this case. Thanks.	[reply]
Re^3: Word Frequency in Particular Sentences by roboticus (Chancellor) on Mar 28, 2008 at 03:51 UTC
OK ... here's a small code example to get you started. (You'll still want to hit CPAN for a PDF parsing module, though.) #!/usr/bin/perl -w use strict; use warnings; # Tell perl to split records on periods. $/ = '.'; my %words; # Read successive lines from our __DATA__section while (<DATA>) { # Skip the sentence unless it contains the text "asia" next unless m/asia/i; # Remove extraneous characters tr/a-zA-Z/ /cs; # Show each sentence we keep print "<$_>\n"; # Increment the counter for each word found map { $words{$_}++ } split; } print "\n\n" . "Count Word\n" . "----- -------------\n"; # Print all words in the sentences that appear more than once. for (sort keys %words) { next unless $words{$_} > 1; print "$words{$_}\t$_\n"; } __DATA__ Now is the time for all good. Men to come to the Asia of their party. The quick red fox jumped over the calico cat. One fish two fish asiatic fish blue fish. Zoom. When must we come to asia to see the fox? Dolum ipsum dolor est. Canem homo mordet. I would guess that few people speak latin in Asia. Perhaps many more asians speak greek. But how would I know? [download] When run on my machine, it gives us: `roboticus~ $ ./re_test.pl < Men to come to the Asia of their party > < One fish two fish asiatic fish blue fish > < When must we come to asia to see the fox Dolum ipsum dolor est > < I would guess that few people speak latin in Asia > < Perhaps many more asians speak greek > Count Word ----- ------------- 2 Asia 2 come 4 fish 2 speak 2 the 4 to roboticus~ $` [download] ...roboticus	[reply] [d/l] [select]
Re^3: Word Frequency in Particular Sentences by nefigah (Monk) on Mar 28, 2008 at 06:25 UTC
And here is some code for getting the text out of a PDF, using an excellent little CPAN module called CAM::PDF. (If you don't know how to install CPAN modules, just ask). This goes through a PDF page-by-page, grabbing the text, and then saves it all to a text file. Note that if your PDF is huge you may want to modify this to do it in chunks (the 367 page PDF I tested it on only took a few seconds, though). `#!/usr/bin/perl + use warnings; use strict; use CAM::PDF; my $pdf_path = $ARGV[0] or die "No pdf specified"; my $pdf = CAM::PDF->new($pdf_path); my $text = ''; for my $page (1..$pdf->numPages) { $text .= $pdf->getPageText($page); } open my $file, '>', 'pdftext.txt'; print $file $text; close $file;` [download] I'm a peripheral visionary... I can see into the future, but just way off to the side.	[reply] [d/l]
Re^4: Word Frequency in Particular Sentences by Anonymous Monk on Mar 28, 2008 at 16:40 UTC
Re^5: Word Frequency in Particular Sentences by planetscape (Chancellor) on Mar 28, 2008 at 22:54 UTC
Re^5: Word Frequency in Particular Sentences by nefigah (Monk) on Mar 28, 2008 at 18:09 UTC
Some notes below your chosen depth have not been shown here


Think about Loose Coupling
	PerlMonks