Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

tutelage needed

by ctp (Beadle)
on Jan 01, 2004 at 01:23 UTC ( [id://318060]=perlquestion: print w/replies, xml ) Need Help??

ctp has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE 01/03/04 - Just got home from Disneyland and added some code for whoever is still tuned into this thread.
So far I have:
#! usr/bin/perl #midterm part 1 use warnings; open (TEXTFILE, 'text.txt') or die ("Cannot open file : $!"); $big_string = <TEXTFILE>; $big_string = lc ($big_string); $big_string =~ s/\(|\)/ /g; @words = split (/[,.\s+]/, $big_string); foreach (@words){ push @gt_three_char_words, $_ if /[a-zA-Z]{4,}|[a-zA-Z]{3,}'/; } foreach (@gt_three_char_words) { $hash{$_}++; } foreach (%hash) { print $_, "\n"; }
As you can see I haven't implemented the slurp yet, but I did get nnn'n words to work, and I now have a hash with word keys and count values. I went for the foreach instead of the map for now. I'll play with map (and grep) later. It may be all downhill from here, so many thanks to everyone who helped...I learned a ton!
Quoted from the original post:
The problem at hand is to write a script which will read a text file, and list the most common words found, >4 characters, and print out the top ten, each with their number of occurences, sorted by frequency.

Replies are listed 'Best First'.
Re: tutelage needed
by Zaxo (Archbishop) on Jan 01, 2004 at 02:18 UTC

    Are you familiar with Perl hashes? They are very helpful for questions like this. With them, a string indexes a scalar chunk of data. That can be applied to a wordcount by just incrementing the value keyed by each word you see. After that, sort can be told to pick out the keys with the highest values, and grep to filter out short ones (or else don't add them to the hash in the first place).

    Your split doesn't do exactly what you want, it will split "want, it" into three, with a zero length 'word' from between the comma and the space. You may want to replace the space in your character class with \s, add more punctuation, and allow it to repeat with the + quantifier.

    Check out the length function.

    What is unexpected about the behavior of apostrophes?

    After Compline,
    Zaxo

      Are you familiar with Perl hashes?

      getting there

      That can be applied to a wordcount by just incrementing the value keyed by each word you see

      I planned on needing to use a hash here, but it was the creation and filling of that hash that had me stuck...actually it was just the filling of it, they're pretty easy to create :)

      Your split doesn't do exactly what you want

      ah...good suggestions, I'll go try them out

      Check out the length function.

      I did earlier in my attempts, but I didn't get very usable results. It told me every word was 1 byte long regardless of actual length. Probably not using it right.

      What is unexpected about the behavior of apostrophes?

      well, I consider a word like dog's, or cat's a 4 letter word, but my script doesn't seem to.

        well, I consider a word like dog's, or cat's a 4 letter word, but my script doesn't seem to

        Nope, it sure doesn't. Consider your code that picks out the 4 letter words:

        push @gt_three_char_words, $_ if /[a-zA-Z]{4,}/;

        What types of characters "count" when they're counted this way?

        Gary Blackburn
        Trained Killer

Re: tutelage needed
by pg (Canon) on Jan 01, 2004 at 01:31 UTC

    lc() does not modify the string you passed in, what you should do is to assign what returned from lc() back to $big_string.

    If you add "use warnings" to the begining of your script. You will see:

    Useless use of lc in void context at foo line bar.

    Update:

    Just to add one piece of sample code:

    use strict; use warnings; my $a = "AbCdEfG"; $a = lc($a); print $a;

    Update 2:

    How about "use strict", and declare your variables.

    So you expect your file to be one-liner, don't you? As you only read one line from it.

      hmmmm...I had tried it that way, but it didn't work. May have been a compound error. I'll go back and try again.

      Also will turn warnings back on...thanks.

      UPDATE - two folks have mentioned the one line thing, but when I ran the script against a couple pages of text I got a list of a couple hundred words of 4 or more characters. Doesn't that s switch take care of the one line only trouble?
Re: tutelage needed
by jweed (Chaplain) on Jan 01, 2004 at 02:15 UTC

    Okay, a couple of things:

    1 - Do you expect your file to be one long line? If not, you need to "slurp" the file, rather than doing what you're doing now (reading only the first line). Try $big_string = do {local $/; <TEXTFILE>};.

    2 - Your first substitution statement does not require the outside capturing parens. It's just noisy.

    3 - I wouldn't sort the array before doing the frequency count, as it just takes time for little gain.

    4 - Finally, in response to your last question as to how to actually do the count, I have a few suggestions. An idiom that is useful is, for each item to say $hash{$word}++. It creates an entry if the word has not been seen before, and increments it if has. Use a for loop or a map statement to construct the hash. In the end, use a sort with a routine which sorts by the entries in the hash and (optionally) afterwards ascii-betically by the actual words.


    Hope that helped!



    Who is Kayser Söze?
    Code is (almost) always untested.
      1- I thought the s modifier treated the string as one line. Am I reading the meaning of that wrong?

      2 - oh, cool...thanks. That's one of those cases where I start writing a regex, and tweak it repeatedly until it works...but then since, by some miracle, it does work I am reluctant to tweak it further :)

      3- yea - I wrote that line to see if I could knowing I might need it a little later.

      4- I've seen that form before, but I'm gonna try to figure out how to implement it. I have a map statement example here in one of my books that kinda is making sense to me. I'll give it a try.

      thanks!
        I don't see an s modifier anywhere. Am I missing something?


        Who is Kayser Söze?
        Code is (almost) always untested.
Re: tutelage needed
by toma (Vicar) on Jan 01, 2004 at 17:54 UTC
    To find programs that already do what you want, search for a 'concordance' generator. I have a few of these laying around since I assigned it as homework. None are production quality but I can post one if you would like.

    In order to remove word endings, find the stem of each word. This is called 'stemming'.

    There are modules for each of these on CPAN. Lingua::Stem and WordNet::QueryData, although the WordNet module is overkill for what you need.

    It should work perfectly the first time! - toma
Re: tutelage needed
by injunjoel (Priest) on Jan 01, 2004 at 23:23 UTC
    Greetings all,
    Many good comments I thought I might give this one a shot. Here is the methodology I would try.
    1. Create a hash that will be keyed by each of the words in your file the values will be a count of how many times each word (key) appears.
    2. Test that you successfully open your file.
    3. Once opened read the lines of the file one at a time with a while(<FILEHANDLE>){ #logic } loop.
    4. Lowercase all the characters in the line.
    5. With each line replace all the non-word characters with a single space (in case someone did not add a space after a period or between commas), this could be where you deal with your apostrophes as well.
    6. Split the line based on word boundaries (\b I think is the regex character)
    7. Go through the split list word by word if they are longer than four characters and already defined in the hash ++ the hash element keyed by the current word from your split list else add the key to the hash and initialize its value to one.
    8. Once all lines are done sort the hash based on the values. sort keys question is a good discussion on how you can do that.
    9. Print the top ten.
    10. Marvel at the power of perl.
      Awesome stuff, and much help and idea fodder. I will try some of them out as soon as I can. I followed the sort keys link just now and made use of some info there. Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://318060]
Approved by rob_au
Front-paged by BUU
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-16 12:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found