Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Estimating Vocabulary

by YuckFoo (Abbot)
on Mar 27, 2002 at 02:59 UTC ( #154571=CUFP: print w/replies, xml ) Need Help??

The little woman was concerned about the size of our youngster vocabulary. I told her 'Relax, he knows thousands of words'. But how many does he really know? Of course a Perl program can help estimate.

This program prints a sample of the dictionary. You count how many words of the sample are known and multiply by the multiplier that is shown to estimate the size of the vocabulary. I found it easiest to redirect output to a file, use vi to delete all the unknown words and count whats left over.

On four runs of 64 word samples, I gave my boy credit for knowing 19, 18, 19 and 17. 46239 words in my dictionary gave me a multiplier of 722.5. 18 * 722.5 is about 13000 words.

Now I think I'll go give wifey the test, see if she's satisfied with her score. :)

The code is nothing spectacular, anybody here could write it, and I'm sure many could make it a one-liner, but I think it's a CUFP nonetheless.


#!/usr/bin/perl use strict; my $DICT = '/usr/dict/words'; my ($num) = @ARGV; my ($i, $total, $mult, @words); $num ||= 100; if (!open(IN, $DICT)) { print "\n$! - $DICT\n\n"; exit; } $total = @words = <IN>; $mult = $total / $num; for $i (1..$num) { print splice(@words, rand(@words), 1); } print "$total words in dictionary.\n"; print "Multiply number of words known in this list by $mult.\n";

Replies are listed 'Best First'.
Re: Estimating Vocabulary
by belg4mit (Prior) on Mar 27, 2002 at 03:36 UTC
    Well I suppose that depends on your defintion of word. am, are, is, was - are these each words? Also IIRC the English language is purported to have a lexicon on the order of 320,000 words*. The average American vocabulary has been in steady decline since the early twentieth century at which point I believe it was on the order of several thousand words*. A few things to consider:
  • dictionaries may contain archaic forms
  • does your dictionary contain proper nouns? do you care?
  • the content of the language is not evenly distributed across the lexicon, e.g. a single word (sans modifiers) for "love" and a plethora for shades of blue.
  • * I shall attempt to find evidence to support this. An enlightening thread, but then again it is usenet... Apparently this is a pretty hotly contested topic.

    perl -pe "s/\b;([st])/'\1/mg"

      Good points all, belg4mit.

      * If the sample is large enough, the correct percentage of archaic words will be in the sample, it'll work itself out.

      * I had already removed proper nouns, nouns containing any uppercase letter. I should have noted that, but again I'm not sure it matters with a large enough sample.

      * I'm not sure how words should really be counted, still looking for a reference myself. For my purpose, I am considering run, runs, ran, running as unique words.

      I'm just looking for a ballpark number. It seems like a good ballpark to me that if the boy consistently knows 20-25% of the words in the sample, he should know 20-25% of the words in $DICT.

      If anyone has pointers to real vocabulary development numbers and counting methods, I'd like to get'em.


Re: Estimating Vocabulary
by belg4mit (Prior) on Mar 27, 2002 at 07:21 UTC
    Well here's an alternate, a complete waste of cycles as it scales linearly with the number of words returned, OTOH it is not bounded by the size of the dictionary. (As is) It can also return duplicates, yada yada yada.
    my(@lines, $line); open(FILE, shift) || die; until( scalar @lines == $ARGV[0] ){ seek(FILE, 0, $. = 0); rand($.) < 1 && ($line = $_) while <FILE>; push(@lines, $line); } print @lines, "wc -l could have told you this is $. words\n";
    It's based on "How do I select a random line from a file?" in perlfaq5. I'd be interested in seeing if anybody else has a better means of extending this algorythm to report multiple entries.

    perl -pe "s/\b;([st])/'\1/mg"

      my(@lines, $line); open(FILE, shift) || die; 1 while <FILE>; $line=$.; seek(FILE, 0, $. = 0); rand($line-$.) < $ARGV[0]-@lines && push(@lines,$_) while <FILE>; print @lines, "wc -l could have told you this is $. words\n";

        WAS: That does not appear to work, I ask for one line and get 13-18 lines... It is also heavily weighted towards the Zs

        perl -pe "s/\b;([st])/'\1/mg"

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://154571]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2022-09-27 01:15 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (118 votes). Check out past polls.