tutelage needed

ctp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: tutelage needed by Zaxo (Archbishop) on Jan 01, 2004 at 02:18 UTC
Are you familiar with Perl hashes? They are very helpful for questions like this. With them, a string indexes a scalar chunk of data. That can be applied to a wordcount by just incrementing the value keyed by each word you see. After that, sort can be told to pick out the keys with the highest values, and grep to filter out short ones (or else don't add them to the hash in the first place). Your split doesn't do exactly what you want, it will split "want, it" into three, with a zero length 'word' from between the comma and the space. You may want to replace the space in your character class with `\s`, add more punctuation, and allow it to repeat with the `+` quantifier. Check out the length function. What is unexpected about the behavior of apostrophes? After Compline, Zaxo	[reply]
Re: Re: tutelage needed by ctp (Beadle) on Jan 01, 2004 at 03:16 UTC
Are you familiar with Perl hashes? getting there That can be applied to a wordcount by just incrementing the value keyed by each word you see I planned on needing to use a hash here, but it was the creation and filling of that hash that had me stuck...actually it was just the filling of it, they're pretty easy to create :) Your split doesn't do exactly what you want ah...good suggestions, I'll go try them out Check out the length function. I did earlier in my attempts, but I didn't get very usable results. It told me every word was 1 byte long regardless of actual length. Probably not using it right. What is unexpected about the behavior of apostrophes? well, I consider a word like dog's, or cat's a 4 letter word, but my script doesn't seem to.	[reply]
Re: Re: Re: tutelage needed by Trimbach (Curate) on Jan 01, 2004 at 05:12 UTC
well, I consider a word like dog's, or cat's a 4 letter word, but my script doesn't seem to Nope, it sure doesn't. Consider your code that picks out the 4 letter words: `push @gt_three_char_words, $_ if /[a-zA-Z]{4,}/;` [download] What types of characters "count" when they're counted this way? Gary Blackburn Trained Killer	[reply] [d/l]
Re: tutelage needed by ctp (Beadle) on Jan 01, 2004 at 05:34 UTC
Re: tutelage needed by pg (Canon) on Jan 01, 2004 at 01:31 UTC
lc() does not modify the string you passed in, what you should do is to assign what returned from lc() back to $big_string. If you add "use warnings" to the begining of your script. You will see: `Useless use of lc in void context at foo line bar.` [download] Update: Just to add one piece of sample code: `use strict; use warnings; my $a = "AbCdEfG"; $a = lc($a); print $a;` [download] Update 2: How about "use strict", and declare your variables. So you expect your file to be one-liner, don't you? As you only read one line from it.	[reply] [d/l] [select]
Re: Re: tutelage needed by ctp (Beadle) on Jan 01, 2004 at 01:41 UTC
hmmmm...I had tried it that way, but it didn't work. May have been a compound error. I'll go back and try again. Also will turn warnings back on...thanks. UPDATE - two folks have mentioned the one line thing, but when I ran the script against a couple pages of text I got a list of a couple hundred words of 4 or more characters. Doesn't that s switch take care of the one line only trouble?	[reply]
Re: tutelage needed by jweed (Chaplain) on Jan 01, 2004 at 02:15 UTC
Okay, a couple of things: 1 - Do you expect your file to be one long line? If not, you need to "slurp" the file, rather than doing what you're doing now (reading only the first line). Try `$big_string = do {local $/; <TEXTFILE>};`. 2 - Your first substitution statement does not require the outside capturing parens. It's just noisy. 3 - I wouldn't sort the array before doing the frequency count, as it just takes time for little gain. 4 - Finally, in response to your last question as to how to actually do the count, I have a few suggestions. An idiom that is useful is, for each item to say `$hash{$word}++`. It creates an entry if the word has not been seen before, and increments it if has. Use a `for` loop or a `map` statement to construct the hash. In the end, use a sort with a routine which sorts by the entries in the hash and (optionally) afterwards ascii-betically by the actual words. Hope that helped! Who is Kayser Söze? Code is (almost) always untested.	[reply] [d/l] [select]
Re: Re: tutelage needed by ctp (Beadle) on Jan 01, 2004 at 02:54 UTC
1- I thought the s modifier treated the string as one line. Am I reading the meaning of that wrong? 2 - oh, cool...thanks. That's one of those cases where I start writing a regex, and tweak it repeatedly until it works...but then since, by some miracle, it does work I am reluctant to tweak it further :) 3- yea - I wrote that line to see if I could knowing I might need it a little later. 4- I've seen that form before, but I'm gonna try to figure out how to implement it. I have a map statement example here in one of my books that kinda is making sense to me. I'll give it a try. thanks!	[reply]
Re: Re: Re: tutelage needed by jweed (Chaplain) on Jan 01, 2004 at 03:00 UTC
I don't see an s modifier anywhere. Am I missing something? Who is Kayser Söze? Code is (almost) always untested.	[reply]
Re: tutelage needed by ctp (Beadle) on Jan 01, 2004 at 03:19 UTC
Re: Re: tutelage needed by jweed (Chaplain) on Jan 01, 2004 at 07:22 UTC
Some notes below your chosen depth have not been shown here
Re: tutelage needed by toma (Vicar) on Jan 01, 2004 at 17:54 UTC
To find programs that already do what you want, search for a 'concordance' generator. I have a few of these laying around since I assigned it as homework. None are production quality but I can post one if you would like. In order to remove word endings, find the stem of each word. This is called 'stemming'. There are modules for each of these on CPAN. Lingua::Stem and WordNet::QueryData, although the WordNet module is overkill for what you need. It should work perfectly the first time! - toma	[reply]
Re: tutelage needed by injunjoel (Priest) on Jan 01, 2004 at 23:23 UTC
Greetings all, Many good comments I thought I might give this one a shot. Here is the methodology I would try. Create a hash that will be keyed by each of the words in your file the values will be a count of how many times each word (key) appears. Test that you successfully open your file. Once opened read the lines of the file one at a time with a while(<FILEHANDLE>){ #logic } loop. Lowercase all the characters in the line. With each line replace all the non-word characters with a single space (in case someone did not add a space after a period or between commas), this could be where you deal with your apostrophes as well. Split the line based on word boundaries (\b I think is the regex character) Go through the split list word by word if they are longer than four characters and already defined in the hash ++ the hash element keyed by the current word from your split list else add the key to the hash and initialize its value to one. Once all lines are done sort the hash based on the values. sort keys question is a good discussion on how you can do that. Print the top ten. Marvel at the power of perl.	[reply]
Re: Re: tutelage needed by ctp (Beadle) on Jan 04, 2004 at 06:07 UTC
Awesome stuff, and much help and idea fodder. I will try some of them out as soon as I can. I followed the sort keys link just now and made use of some info there. Thanks!	[reply]


We don't bite newbies here... much
	PerlMonks