Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Word Frequency in Particular Sentences

by Anonymous Monk
on Mar 27, 2008 at 21:20 UTC ( [id://676856]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a PDF file. (1) I want to list the sentences containing the word "Asia" from that file. (2) I want to make up a word frequency table from just those sentences.The sentences begin/end with periods. Basically, I am trying to see what words occur most freqently when the word Asia is present. I am a beginner in PERL so please tell me if there is already a program out there for this somewhere.

Replies are listed 'Best First'.
Re: Word Frequency in Particular Sentences
by swampyankee (Parson) on Mar 27, 2008 at 22:04 UTC

    First, look for pdf in CPAN to find a module that parses PDF files.

    Second, watch out for abbreviations, which usually end with a period, and sentences that end with something other than a period.

    Third, it's Perl, not PERL; Perl is not an acronym.


    emc

    Information about American English usage here and here. Floating point issues? Please read this before posting.

      swampyankee++ for noticing the problem with abbreviations. Short of a the ability to parse and comprehend grammar, it's going to be very difficult to separate

      "We sold the division to MegaTech, Ltd. in Asia last week, who flipped the sale to someone else."
      from
      "We sold the division to MegaTech Industries. In Asia last week, they flipped the sale to someone else."
      other than the fact that we are supposed to start a new sentence with an upper-case letter. There may be examples where that following word is a proper noun, however -- in which case it's going to be a very hard nut to crack.

      If, however, you only care about the "typical" case (because this is going to be a one-shot tool), you could:

      1. Split the text on /[.]\s+[A-Z]/ to get sentences.
      2. Grep the text for /[aA]sia/, or for /Asia\s/ if you don't want the word "asian" to count.
      3. Split the sentences that pass on ' ' to get words.
      4. Use the words you get from that split as keys to a hash, and increment a count in each bin.
      Q.E.D.

Re: Word Frequency in Particular Sentences
by nefigah (Monk) on Mar 27, 2008 at 22:20 UTC

    Everybody stand back! :)

    As was well stated, someone has already written things to get text out of PDFs for you. The second problem of finding sentences with "Asia" in them is more interesting, and should be a good learning exercise for you.

    So, pretending that you already have a plain text file full of, erm, text available, how would you go about identifying asian sentences in it? Do you have an idea how you would begin? (Trying to ascertain what you already know/what you have already written)


    I'm a peripheral visionary... I can see into the future, but just way off to the side.

      Let me 'fess up. I was hoping that a similar problem has been solved already by someone and I could simply adapt that. As a well aged academic economist, I am way past trying to master Perl at any deep level. Still I will be grateful if you (or someone) could assure me that this is a doable problem in Perl and maybe point to a few functions/regular expressions(?) that may be used in this case. Thanks.
        OK ... here's a small code example to get you started. (You'll still want to hit CPAN for a PDF parsing module, though.)

        #!/usr/bin/perl -w use strict; use warnings; # Tell perl to split records on periods. $/ = '.'; my %words; # Read successive lines from our __DATA__section while (<DATA>) { # Skip the sentence unless it contains the text "asia" next unless m/asia/i; # Remove extraneous characters tr/a-zA-Z/ /cs; # Show each sentence we keep print "<$_>\n"; # Increment the counter for each word found map { $words{$_}++ } split; } print "\n\n" . "Count Word\n" . "----- -------------\n"; # Print all words in the sentences that appear more than once. for (sort keys %words) { next unless $words{$_} > 1; print "$words{$_}\t$_\n"; } __DATA__ Now is the time for all good. Men to come to the Asia of their party. The quick red fox jumped over the calico cat. One fish two fish asiatic fish blue fish. Zoom. When must we come to asia to see the fox? Dolum ipsum dolor est. Canem homo mordet. I would guess that few people speak latin in Asia. Perhaps many more asians speak greek. But how would I know?
        When run on my machine, it gives us:

        roboticus~ $ ./re_test.pl < Men to come to the Asia of their party > < One fish two fish asiatic fish blue fish > < When must we come to asia to see the fox Dolum ipsum dolor est > < I would guess that few people speak latin in Asia > < Perhaps many more asians speak greek > Count Word ----- ------------- 2 Asia 2 come 4 fish 2 speak 2 the 4 to roboticus~ $
        ...roboticus

        And here is some code for getting the text out of a PDF, using an excellent little CPAN module called CAM::PDF. (If you don't know how to install CPAN modules, just ask).

        This goes through a PDF page-by-page, grabbing the text, and then saves it all to a text file. Note that if your PDF is huge you may want to modify this to do it in chunks (the 367 page PDF I tested it on only took a few seconds, though).

        #!/usr/bin/perl + use warnings; use strict; use CAM::PDF; my $pdf_path = $ARGV[0] or die "No pdf specified"; my $pdf = CAM::PDF->new($pdf_path); my $text = ''; for my $page (1..$pdf->numPages) { $text .= $pdf->getPageText($page); } open my $file, '>', 'pdftext.txt'; print $file $text; close $file;


        I'm a peripheral visionary... I can see into the future, but just way off to the side.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://676856]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-25 17:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found