Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

OT: How to find anagrams?

by Cody Pendant (Prior)
on Apr 27, 2004 at 23:21 UTC ( [id://348668]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I don't think I'd have trouble creating a perl script which found one-word anagrams for one-word input -- it seems fairly simple to find that CLEANSE is an anagram of SCALENE.

But how do you find anagrams when you're allowed multiple words in the outcome? Finding, that is, that CLEANSE is an anagram of "SEE CLAN", "SCAN EEL", "LES CANE", "SANE CEL" and so on and so on?

I know there are online services which do that kind of thing, but, I'm not much good on theory, isn't this one of those problems which turns into an n2, n3, n4 type problem, with n as a big number to start with?



($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: OT: How to find anagrams?
by hv (Prior) on Apr 28, 2004 at 00:35 UTC

    One interesting approach I've considered, but never implemented, is to assign a prime number to each letter, and then hash the target word to the product of the values of its letters. The hash function for any word then divides the target product if and only if the letters of the word are a partial anagram - tricksy things like repeating letters are taken care of automatically. Then you just search for combinations of words whose hashes multiply together to give the target value; and if you calculate the hashes over the whole dictionary in advance and store in sorted order I think this could give pretty good performance.

    For the standard 26 letter alphabet you'll need primes 2 .. 101, and you're going to need BigInts which will slow things down a bit (but not too badly if you use it with a fast maths package such as Math::Pari or Math::GMP). You can reduce the size of the numbers a bit further by assigning the lowest primes to the most frequently occurring letters.

    Hugo

Re: OT: How to find anagrams?
by dragonchild (Archbishop) on Apr 27, 2004 at 23:31 UTC
    The brute force algorithm has O( n! ). However, one can take advantage of a property of anagrams which is that all anagrams have the same letters in them and can be compared for equality when you sort the letters.

    Now, you're asking "How on earth can I do that with multiple words??" Well, you take advantage of another factor which is that anagrams have to have the same number of letters. So, if you want to find all words which can be made up of the letters in another word, you can restrict your search space to words whose letters add up to the right number.

    So, in your example, you have a 7-letter word whose sorted letters are ACEELNS.

    1. Take all 7 letter words in the dictionary and sort them. Then, compare each one to the sorted string.
    2. Take all 6 letter words and 1 letter words. Do the same with each pairing.
    3. Take all 5 letter words and 2 letter words. Do the same.
    4. Take all 5 letter words and doubles of 1 letter words. Do the same.

    Keep doing that for every combination numbers that add up to 7.

    Now, this doesn't actually reduce the O() factor of the problem. Buuutt ... it does bring the actual runtime of the problem down to where modern computers can do it in reasonable time.

    There are further optimizations to the algorithm, like taking advantage of letter frequencies and the like, but those are left as an exercise for the reader.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Re: OT: How to find anagrams?
by davido (Cardinal) on Apr 28, 2004 at 00:20 UTC
    If you're permitted to make use of a prefab codebase you can turn to CPAN. I found words. It is a part of the Bundle::PPT distribution (also see ppt-0.12).

    The brief description for words is "find words which can be made from a string of letters."

    The code for 'words' is all pure Perl, and having looked it over, it doesn't seem terribly complex. You might at least use it as a startingpoint.


    Dave

Re: OT: How to find anagrams?
by eric256 (Parson) on Apr 27, 2004 at 23:50 UTC

    You could further restrict the search to only words that contain those and only those letters. Thats a fairly quick regex with a character class and you've significantly reduced your word sets. MySQL (if you have your database in that) will even let you use regex in a query although i don't know if that will be faster or slower than doing them on your end.

    I had a program once that took a set of rules and found all words that matched them, it could normaly be done in a single SQL statment but adding a temporary table saved a lot of time and shoehorning. :)

    What is your source of words?


    ___________
    Eric Hodges

      You could generate an SQL statement (on other DBs) such as...

      SELECT word FROM words WHERE length(word) = N AND word like '%A%A%' (if there are 2 'A' in the orig) AND word like '%B%' (if there is 1 'B' in the orig) ... ...
      to get all of your candidates.

      Of course, this assumes that your word list is in a DB.

Re: OT: How to find anagrams?
by pizza_milkshake (Monk) on Apr 28, 2004 at 02:58 UTC
    not the prettiest code i've ever written, but it seems to do the trick.
    #!perl -wl use strict; $_ = shift @ARGV || "CLEANSE"; our (%kw, %words); my ($match, @matches); my $future = {"" => ""}; sub sortword { return join("", sort pop=~/./g) } sub dissect { my %h; map $h{$_}++, split(//,pop); return \%h } sub scrub { # make sure a word isn't longer than our # kw or has chars kw doesn't my $w = shift @_; return 0 if length($w) > $kw{"len"}; my $d = dissect($w); map { return 0 if !defined($kw{"d"}->{$_}) || $kw{"d"}->{$_} < $d->{$_} } keys %$d; return 1; } sub process { # run one iteration through %words my ($cand, @match, %future) = @_; for my $w (keys %words) { for my $c (keys %$cand) { my $len = length($cand->{$c}) + $words{$w}; next if $len > $kw{"len"}; my $nk = "$c:$w"; $future{$nk} = sortword($cand->{$c} . $w); if ($len == $kw{"len"}) { push @match, $nk if $future{$nk} eq $kw{"kw"}; delete $future{$nk}; } } } return (\@match, \%future); } %kw = ( "word" => $_ ,"len" => length($_) ,"kw" => sortword($_) ,"d" => dissect($_) ); map { chomp; $words{$_} = length() if scrub($_); } <DATA>; while (keys %$future) { ($match, $future) = process($future); push @matches, @$match; } print "anagrams for " . $kw{"word"} . ":"; map print, @matches; __DATA__ SEE SCALENE CLAN EEL BOB PIZZA SCAN MONKS PERL

    perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'

Re: OT: How to find anagrams?
by bl0rf (Pilgrim) on Apr 28, 2004 at 02:20 UTC
    isn't this one of those problems which turns into an n2, n3, n4 type problem, with n as a big number to start with?
    :-D, most problems end up that way!

    I think dragonchild has a good algorithm, I would only add that, before looping you should make a pass over your dictionary and copy to a new file/memory-location all words that are composed only of the letters ACELNS. This way you will be able to conduct consequent searches on this small subset of the whole dictionary as opposed to the whole thing. This will optimize your O() a lot ( btw O means the speed at which each operation is made, don't think about it a lot if you're not familiar with it... )

    My site
    Too cheap to make a signature
Re: OT: How to find anagrams?
by Willard B. Trophy (Hermit) on Apr 28, 2004 at 02:03 UTC
    I'd cheat, and use a command-line anagram generator written in C. The fastest one I've seen is the unfortunately named an, which makes it almost impossible to search for.

    --
    bowling trophy thieves, die!

Re: OT: How to find anagrams?
by Cody Pendant (Prior) on Apr 28, 2004 at 05:59 UTC

    Thanks for your contributions, very interesting all of them.



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
    =~y~b-v~a-z~s; print
Re: OT: How to find anagrams?
by talexb (Chancellor) on Apr 28, 2004 at 18:59 UTC

    Check out Permute for a useful module and a bunch of relevant links.

    You could also run an analysis on how often one letter follows another in /usr/share/dict/words and see if you can use that to skip checking certain combinations .. for example, my words file contains no words with the pattern 'lsn' .. which means you could skip any permutation that contains that pattern.

    Alex / talexb / Toronto

    Life is short: get busy!

Re: OT: How to find anagrams?
by Plankton (Vicar) on Apr 28, 2004 at 19:30 UTC
    It shouldn't be to hard to adapt Jumble solver CGI to handle multiple words. Just insert a space between each letter, permute, and test the spelling of each word in the string. I don't know if the performance would be all that great, but gmax had a fast version of this here Re: Jumble solver

    Plankton: 1% Evil, 99% Hot Gas.
      Heres a code snippet which could help you a bit ...
      didnt use it for a long time, but AFAIR it works and returns all possible permutations for the given string.
      sub DoPermuteArray { my @array = @{ $_[0] }; my @permuted; eval (&PermuteArray(['1','1','1','1','-','--'],[],\@permuted)); return @permuted; } sub PermuteArray { # taken from the perl FAQ my @items = @{ $_[0] }; my @perms = @{ $_[1] }; my $permuted = $_[2]; my @return; unless (@items) { push (@$permuted,[@perms]); } else { my(@newitems,@newperms,$i); foreach $i (0 .. $#items) { @newitems = @items; @newperms = @perms; unshift(@newperms, splice(@newitems, $i, 1)); &PermuteArray ([@newitems], [@newperms]); } } }

      Rgds.
      Gnork

      cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
      errors->(c)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://348668]
Front-paged by pbeckingham
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-03-28 21:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found