Re: Re: Re: Perl's pearls

It seems like the main improvement/optimization would be not looping twice through the list of all words. Move *all* processing into the main loop:

my (%word, %gram);

while (<>) {
    chomp;

#     $_ = lc $_;                     
    /[^a-z]/ and next;
    my $sig = pack "C*", sort unpack "C*", $_;

    if (exists $word{$sig}) {             

        if (exists $gram{$sig}) {
            next if $gram{$sig} =~ /\b$_\b/; 
            $gram{$sig} .= " $_";            # rare
        }
        else {
            next if $word{$sig} eq $_;
            $gram{$sig} = "$word{$sig} $_";  # rare
        }
    }
    else {
        $word{$sig} = $_;                    # mostly
    }
}

print join "\n", (sort values %gram), '';    # just output short list
[download]

Only the first word of an anagram set is in both lists.
Here's some more finds, mostly from the short OED from here

ablest bleats stable tables
adroitly dilatory idolatry
angered derange enraged grandee grenade
ascertain cartesian sectarian
asleep elapse please
aspirant partisan
attentive tentative
auctioned cautioned education
canoe ocean
comedian demoniac
compile polemic
covert vector
danger gander garden
deist diets edits idest sited tides
emits items metis mites smite times
emitter termite
lapse leaps pales peals pleas
nastily saintly
obscurantist subtractions
observe obverse verbose
opt pot top
opts post pots spot stop tops
opus soup
oy yo
petrography typographer
peripatetic precipitate
present repents serpent
presume supreme
resin rinse risen siren
salivated validates
slitting stilting tiltings titlings tlingits
views wives
vowels wolves
woodlark workload
[download]

Comment on Re: Re: Re: Perl's pearls Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: Perl's pearls by gmax (Abbot) on Jan 02, 2002 at 20:55 UTC
Brilliant! On my computer, your script is 13% faster than mine, using my 100_000 words list. With the one that you suggested (thanks, BTW) which is more than double, the gain is 23%! It means that yous solution is more scalable and thus better suitable for this kind of tasks. Like every "eureka" solution, your improvement looks quite simple, now that I see it! :-) Thanks. _ _ _ _ (_\|\| \| \|(_\|>< _\|	[reply]


Problems? Is your data what you think it is?
	PerlMonks