http://qs321.pair.com?node_id=143364


in reply to Add-A-Gram Performance

A word challenge! I love word challenges! :)

First, I'll second grep's suggestion to indent your code properly. The lack of indenting discouraged me from understanding your code, and therefore from providing in-depth comments on your code. :(

The most important optimization is to realize that anagrams only need to be tested once, because it's the set of letters that matters, not their order. For example, starting with 'topsy', if you drop the 'y' the next word could be 'post', or 'stop', or 'tops', etc., but you only need to try one of those words.

I implemented this by putting the words in a hash where the key is the sorted letters in a word. For each set of letters, I keep only the first word found, just to save memory. I also push each set of letters onto an array, grouped by length.

Once that's done, I can start looking for add-a-grams, beginning with the longest words. The first step is to set up a stack of attempted solutions. Then, in a loop, I pop an attempt off the stack and try to branch the last set of letters. Testing each branch is as simple and as fast as looking for the set of letters in the hash. Testing the few possible branches and looking for a matching word is much, much faster than testing all the words and looking for a matching branch. (Okay, so maybe that's the most important optimization. :)

For each match, push an attempt onto the stack. I stuck in an additional optimization here; if none of the branches work, I remove the original set of letters from the stack. That way I won't waste time trying the same dead end later.

Once an attempt gets down to three letters, I print the solution. I continue finding solutions at the same length. However, I don't try the shorter lengths after that, since the object is to find the longest possible solution.

My script ran in a few minutes and found 6640 add-a-grams with a word of 16 letters. I'm sure the code could be cleaned up, but I should be getting to bed.

#!/usr/local/bin/perl -w use strict; $| = 1; use Getopt::Std; use vars qw($opt_w); getopts('w:') or die "usage: words.pl [-w <wordlist>] "; my $wordlist = $opt_w || 'wordlist'; open(DICT, $wordlist) or # open word list die "Unable to open $wordlist: $!\n"; my(%words, @letters); while (<DICT>) { chomp; my $letters = join '', sort split //; if (not exists $words{$letters}) { $words{$letters} = [$_, 1]; push @{ $letters[length] }, $letters; } else { # just for fun, keep track of the word count for each set of l +etters $words{$letters}[1]++; } } my(@solutions, @stack, $found); my $i = $#letters; while ($i > 3 and !$found) { print "Trying length $i.\n"; # try each word at this length for my $letters (@{ $letters[$i] }) { @stack = [ $letters ]; while (@stack) { # advance the next attempt my $test = pop @stack; if (length $test->[-1] == 3) { # found a solution! # note it, print it, go back to look for more $found = 1; print_solution($test); next; } # look for branches from the last set of letters my $continue; for my $branch (branch($test->[-1])) { if (exists $words{$branch}) { # found a branch! # note it and push it on the stack $continue = 1; push @stack, [ @$test, $branch ]; } } if (!$continue) { # no branches found; prune the dead end delete $words{ $test->[-1] }; } } } # all done with this length; go one shorter $i--; } # given a set of letters, return all possible branches # (set minus one letter) sub branch { my($letters) = @_; # get all possible branches my @branches = map { my $x = $letters; substr($x, $_, 1) = ''; $x +} 0 .. length($letters) - 1; # remove duplicate branches my %uniq; @uniq{@branches} = (); return sort keys %uniq; } sub print_solution { my($solution) = @_; for my $letters (@$solution) { print "$words{$letters}[0] ($words{$letters}[1]) "; } print "\n"; } __END__

Replies are listed 'Best First'.
Re: Re: Add-A-Gram Performance
by smgfc (Monk) on Feb 05, 2002 at 06:30 UTC
    Thanks alot. I took most of grep's issues to heart, but extreme incompitance prevented me from realizing that when you pass values to a subroutine they are passed by reference so passing a reference to $l and %wordspl would not really slow anything down. It will take a little time for me to really parse your code, but thanks alot. I did post a better formatted and documented version of my program as a response to grep, if you want to have a look at it. Thanks again!