literati cheat / finding words from scrambled letters

sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: literati cheat / finding words from scrambled letters by blokhead (Monsignor) on Nov 20, 2005 at 22:34 UTC
My concern is this. With 7 or more characters, the possible letter combinations are in the thousands.. and then I have to check each combination to verify that it's a word. I'd like to do this using CGI but I fear I won't be able to get it fast enough where it won't time out. You're thinking about this in the wrong direction. If you want to preprocess your dictionary like this (start with a word and then check whether any of its permutations are words), it will take a very long time. In fact, you will be checking the dictionary for every possible combination of letters. Since there are far fewer words in the dictionary than possible combinations of letters, you should look in the other direction: for each word in the dictionary, normalize it (to its sorted form), and then store it in a hash keyed on the normalized version. `my @words = do { open my $fh, "<", "/usr/dict/words" or die; <$fh>; }; chomp @words; sub normalize { join "", sort split //, $_[0]; } my %dict; push @{ $dict{normalize($_)} }, $_ for @words;` [download] This will give you a structure %dict that looks like this: `abd => ["bad", "dab"], aepr => ["pear", "reap", "rape"] ...` [download] Then if you want to find the words that are permutations of $foo, you look at the arrayref $dict{normalize($foo)}. You should also save this processed dictionary file somewhere, so you don't have to do all this work twice. Is map the best way of going about this? I am not too familiar with map and this ability. And I really don't understand the logic of how to use the anagrams or why it would be faster. Map is just a looping construct like foreach, while, etc.. It can't do anything magical that cannot be done with another type of programming construct. You can write both inefficient and efficient code with map. It will not write your algorithms for you. If you don't understand it, don't use it. blokhead	[reply] [d/l] [select]
Re^2: literati cheat / finding words from scrambled letters by swampyankee (Parson) on Nov 21, 2005 at 15:45 UTC
You've been reading Jon Bentley's Programming Pearls, haven't you? emc	[reply]
Re: literati cheat / finding words from scrambled letters by gryphon (Abbot) on Nov 20, 2005 at 22:50 UTC
Greetings sulfericacid, After only a brief bit of thought, two options come to mind: the "top down" and "bottom up" methods (thus named because I can't think of a better name right now). The "top down" approached would be to write a script that: Accepts the up-to-7-character as input Rearranges the letters to create a list of all possible combinations Compares each combination to a dictionary file Where there's a match, print it That may not be the fastest script to run, but it would work, and it would be fairly easy to code and very easy to setup (just a single script and a dictionary file). However, the "bottom up" approach would probably be faster to run. In this approach, you do all the work up front: Write a "setup" script to read a dictionary file For each word in the file, sort the letters and store the results in a hash of arrays where the key is the sorted letters and the value is an arrayref that you push the original word on to `# Kind of like this... push( @{ $data{ join('', sort split(//, $word)) } }, $word );` [download] Save the monsterous hash somewhere; you could stick it in a simple database, XML file, CSV, or whatever; maybe a BerkleyDB would be a good idea Then write "search" script that accepts the up-to-7-character as input Sort the letters and seek that key Now just print out the array's contents `# Kind of like this... print join("\n", @{ $data{ join('', sort split(//, $input)) } }), "\n" +;` [download] The second option would be more difficult to implement, and it would require some setup time to run, but the "production" version would be faster (I'm guessing) than the "top down" method for two reasons: During a run, the first option has to create and search against every possible combination of letters. For larger inputs (those approaching 7), this could be slow. The second option only has to sort the input and lookup the value; thus it's very fast. I hope this helps out. gryphon Whitepages.com Development Manager (DSMS) code('Perl') \|\| die;	[reply] [d/l] [select]
Re: literati cheat / finding words from scrambled letters by GrandFather (Saint) on Nov 20, 2005 at 23:08 UTC
This may get you started: `use strict; #Load the dictionary my @words; my %dict; push @words, split ' ', lc while <DATA>; map {push @{$dict{join '', sort split '', lc}}, lc } @words; # "Generate" a letter set my $letters = join '', sort split '', $words[rand @words]; print "$letters\n"; # Find all the matching words print join ', ', @{$dict{$letters}};` [download] Read more... Sample dictionary (992 Bytes) DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: literati cheat / finding words from scrambled letters by TedPride (Priest) on Nov 21, 2005 at 04:36 UTC
What you want is a linear search, one that checks each word against the set of letters only once, without requiring that you store either the entire dictionary or the entire set of letter combinations in memory at once. The following isn't set up to use strict (I never did find out how to use globals without setting off strict), but it does perform as required: use Benchmark; $letters = 'chunzcii'; $lhash{$_}++ for split //, $letters; $lcount = length($letters); $t0 = new Benchmark; open ($handle, 'dictionary.dat'); while (<$handle>) { chomp; $w++ if scrabble($_); $c++; } close ($handle); $t1 = new Benchmark; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n"; print "$w matches from $c words"; sub scrabble { return 0 if length($_[0]) > $lcount; my %wlhash; $wlhash{$_}++ for split //, $_[0]; for (keys %wlhash) { return 0 if $lhash{$_} < $wlhash{$_} && ($nf += $wlhash{$_} - $lhash{$_}) > $blanks; } return 1; } [download] Using this, I ran through a 200,000-word file in around a second, using almost no memory in the process. And I'm sure someone imaginative will be able to significantly improve on the efficiency of the function and file reads. EDIT: Actually, Benchmark says I used 3.93 seconds (124 matches out of 201252 words).Then I tried a different algorithm, which took 2.32 seconds: use Benchmark; $letters = 'chunzcii'; $sorted = join '', sort split //, $letters; $lcount = length($letters); $t0 = new Benchmark; open ($handle, 'dictionary.dat'); while (<$handle>) { chomp; $w++ if scrabble($_); $c++; } close ($handle); $t1 = new Benchmark; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n"; print "$w matches from $c words"; sub scrabble { return 0 if length($_[0]) > $lcount; $p = 0; for (sort split //, $_[0]) { return 0 if !($p = index($sorted, $_, $p) + 1); } return 1; } [download] Incidently, loading all 200,000+ words into an array and then cycling through them increased the time to 3.08 seconds, so not only is this more wasteful of memory, but it seems to be less efficient as well. EDIT: Oh wait, you need support for blanks. Guess the array method is out, since hashing is much more suited for this. 4.27 seconds for 1240 matches from 201252 words, with one blank: use Benchmark; $letters = 'chuncii_'; $lcount = length($letters); while ($letters =~ /[^a-z]/) { $letters =~ s/[^a-z]//; $blanks++; } $lhash{$_}++ for split //, $letters; $t0 = new Benchmark; open ($handle, 'dictionary.dat'); while (<$handle>) { chomp; $w++ if scrabble($_); $c++; } close ($handle); $t1 = new Benchmark; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n"; print "$w matches from $c words"; sub scrabble { return 0 if length($_[0]) > $lcount; my %wlhash; $wlhash{$_}++ for split //, $_[0]; $nf = 0; for (keys %wlhash) { return 0 if $lhash{$_} < $wlhash{$_} && ($nf += $wlhash{$_} - $lhash{$_}) > $blanks; } return 1; } [download] And 4.53 seconds for 10230 matches from 201252 words, with two blanks. And 4.98 seconds for 65472 matches from 201252 words, with four blanks. And 5.07 seconds for 135718 matches from 201252 words, with all eight letters blanks. EDIT: And sulfericacid couldn't figure out how to make it print the matches, so here's yet one more version. Note that this version requires storing all matches in memory at once, since I'm having them sorted by length and then alphabetically. Now I should probably contact Yahoo and tell them you're cheating... $letters = 'chuncii_'; $lcount = length($letters); while ($letters =~ /[^a-z]/) { $letters =~ s/[^a-z]//; $blanks++; } $lhash{$_}++ for split //, $letters; open ($handle, 'dictionary2.dat'); while (<$handle>) { chomp; push @matches, $_ if scrabble($_); } close ($handle); print join "\n", sort { length($b) <=> length($a) \|\| $a cmp $b } @matc +hes; sub scrabble { return 0 if length($_[0]) > $lcount; my %wlhash; $wlhash{$_}++ for split //, $_[0]; $nf = 0; for (keys %wlhash) { return 0 if $lhash{$_} < $wlhash{$_} && ($nf += $wlhash{$_} - $lhash{$_}) > $blanks; } return 1; } [download] Returned (from a small dictionary sub-set): `zucchini acacia couch lunch kick zinc cut did hen hit ice kin nip sun ugh pi a` [download] EDIT: I also have a use strict / warnings version up on my scratchpad now. Though I can't guarantee it will be there forever.	[reply] [d/l] [select]
Re: literati cheat / finding words from scrambled letters by PerlingTheUK (Hermit) on Nov 20, 2005 at 22:49 UTC
In general blochead has probably coded the basics. I understand the actual request is a complete script execution each time reading the dictionary instead of accessing an existing hash, due to server timeout restrictions. If you still cannot afford the cpu time for reading a complete dictionary file, you might want to generate a lookup table to only access the relevant lines in a sorted dictionary file. So if someone looks for "pear*" you only read the lines containing e. g. "ea" anagrams and do not bother reading the whole file. If this is still insuffuciently quick for e. g. two letter lookups, you migh log anagrams that cannot be executed and based on this information extend the lookups for those results returning too many letters. Cheers, PerlingTheUK	[reply]
Re: literati cheat / finding words from scrambled letters by inman (Curate) on Nov 21, 2005 at 18:19 UTC
This uses a slightly different technique that doesn't normalise anything. The original word is read of the command line and its letters are used to create a character class based regex. The words in the dictionary files are processed one by one and matched against the regex. This acts as a filter to exclude any word that is not made up from letters in the original word with an optional 'extra letter'. Words that pass the first filter are frequency checked to validate them for the overuse of letters before they are output. The frequency checking allows a repeat of one of the original letters for the wildcard. `use strict; use warnings; my $data = shift @ARGV; my $regex = qr /^([$data])([^$data]?)([$data])$/; my %letterfrequency; $letterfrequency{$_}++ foreach split //, $data; OUTER: while (chomp(my $word = <>)) { next unless $word =~ /$regex/i; my %frequency; my $repeat = $2 ? 0 : 1; foreach (split //, $1.$3) { if (++$frequency{$_} > $letterfrequency{$_}) { next OUTER unless $repeat --; } } print "$word\n"; }` [download] This worked 'pretty quickly' (TM) against a 200,000 line dictionary.	[reply] [d/l]
Re: literati cheat / finding words from scrambled letters by Roy Johnson (Monsignor) on Nov 21, 2005 at 18:52 UTC
First: if you are given 7 letters, do you want all words of less-than-or-equal-to 7 letters, or just the 7 letter words? I'm guessing all. The first thing to do is to canonicalize your dictionary by alphabetizing the words. For speed reasons, it would make sense to exclude any words longer than 7 letters, and to output the canonicalized dictionary into multiple files, based on word length. Each line would have the canonicalized version and then all the words that it represents. The next thing to do is to generate all the subsets (the powerset) with two or more members, of the set of characters you've been given. Blanks have to be handled specially: given blank-and-six-letters, you'd generate the powerset of a-plus-those-six-letters, b-plus-those-six-letters, etc. For each member of your powerset, search the appropriate file. Of course, there's nothing Perl-specific about any of this. Caution: Contents may have been coded under pressure.	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks