Re: Finding dictionary words in a string.

contains the words "apple", "cherry" and "blossom".

Actually, there is much more ...

#!/usr/bin/perl -w
use strict;

my $search_word = shift
   or die "search word required\n";

my @words = ();
my $count = 0;
open WORDS, "/usr/share/dict/words"
     or die "can't open words file\n";
while (<WORDS>) {
    chomp;
    printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/i;
}
close WORDS;
[download]

And here is the result:

$ perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 Al
  2 apple
  3 as
  4 ask
  5 blossom
  6 cherry
  7 err
  8 he
  9 her
 10 Herr
 11 Io
 12 Los
 13 loss
 14 mow
 15 of
 16 so
 17 we

I know it's a brute force, but it takes less than half second to give this result.

Consider that Perl RegEx engine is using a Boyer-Moore search algorithm, which is probably the fastest one available for text matching.

Comment on Re: Finding dictionary words in a string. Download Code

Replies are listed 'Best First'.

Re: Re: Finding dictionary words in a string.
by toma (Vicar) on Mar 14, 2004 at 05:58 UTC

I couldn't resist speeding it up a bit. A commonly used trick in search engines is to shift everything to upper case.

Avoiding the i flag in the regular expression roughly doubles the speed, on my machine at least.

I also used Storable to create a slightly faster-loading dictionary.

#!/usr/bin/perl -w
use strict;
use Storable;

my $search_word = uc shift
   or die "search word required\n";

my @words = ();
my @dict;
my $rdict;
my $count = 0;
if (not -e 'words.sto') {
  open WORDS, '/usr/share/dict/words'
    or die "can't open words file\n";
  while (<WORDS>) {
    chomp;
    push @dict, uc $_;
  }
  close WORDS;
  $rdict= \@dict;
  store $rdict, 'words.sto';
}
else {
  $rdict= retrieve('words.sto');
}

for (@$rdict) {
  printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/;
}
[download]

It should work perfectly the first time! - toma

[reply]
[d/l]
[select]

Re: Re: Re: Finding dictionary words in a string.

by cchampion (Curate) on Mar 14, 2004 at 08:02 UTC

I can't figure out how you got that speed improvement. The time you save by removing the /i flag is lost when you build a large array in memory (45,000 words in my machine).

Besides, do you really think an all-uppercase output is better?

$ time perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 Al
  2 apple
  3 as
  4 ask
  5 blossom
  6 cherry
  7 err
  8 he
  9 her
 10 Herr
 11 Io
 12 Los
 13 loss
 14 mow
 15 of
 16 so
 17 we
0.30user 0.00system 0:00.30elapsed 99%CPU 

$ time perl wsearch2.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 AL
  2 APPLE
  3 AS
  4 ASK
  5 BLOSSOM
  6 CHERRY
  7 ERR
  8 HE
  9 HER
 10 HERR
 11 IO
 12 LOS
 13 LOSS
 14 MOW
 15 OF
 16 SO
 17 WE
0.48user 0.00system 0:00.48elapsed 99%CPU

Of course, I ran both scripts at least 4 times, to make sure that the storable file was created and that both scripts were reading their input from the disk cache.

[reply]

Re: Re: Re: Re: Finding dictionary words in a string.

by Anonymous Monk on Mar 15, 2004 at 04:04 UTC

Code Smarter

#!/usr/bin/perl
use strict;
use warnings;
my $sw = lc shift
   or die "search word required\n";

my $count = 0;
open WORDS, "/usr/share/dict/words"
     or die "can't open words file\n";
while (<WORDS>) {
    chomp;
    printf "%3d %s\n", ++$count, $_ if index($sw,lc $_) > -1;
}
close WORDS;
[download]

[reply]
[d/l]

Re: Re: Re: Finding dictionary words in a string.

by toma (Vicar) on Mar 14, 2004 at 18:28 UTC

Even on a single machine, the results depend on the operating system kernel, and most importantly, with the build of perl. My vendor perl is an unusual build, and my own build is more tuned to my machine.

The kernel I have been running, 2.4.22-6.ll.rh80, is an experimental low-latency kernel used for real-time applications. The file system is ext3.

Timing, seconds Kernel 2.4.22-6.ll.rh80 Kernel 2.4.18-14

Vendor perl, loop 2.938 2.569

My perl, loop 1.957 1.677

Vendor perl, Storable, no /i 1.267 0.966

My perl, Storable, no /i 0.687 0.567

If you prefer lower case to make the output look better, just use the lc function instead of the uc function.

I think the lesson is that performance improvements are often difficult to generalize, so you have to try it on your own system and see what works.

UPDATE: The Code Smarter suggestion is very good. For the Storable code, timings for the vendor perl on 2.4.22-6.ll.rh80 the timings are down to about 0.21 seconds and with my perl they are at about 0.13 seconds - too fast for my primitive benchmark.

It should work perfectly the first time! - toma

[reply]
[d/l]
[select]


Perl: the Markov chain saw
	PerlMonks