Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Finding dictionary words in a string.

by Caron (Friar)
on Mar 13, 2004 at 06:58 UTC ( [id://336339]=note: print w/replies, xml ) Need Help??


in reply to Finding dictionary words in a string.

contains the words "apple", "cherry" and "blossom".

Actually, there is much more ...

#!/usr/bin/perl -w use strict; my $search_word = shift or die "search word required\n"; my @words = (); my $count = 0; open WORDS, "/usr/share/dict/words" or die "can't open words file\n"; while (<WORDS>) { chomp; printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/i; } close WORDS;

And here is the result:

$ perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 Al
  2 apple
  3 as
  4 ask
  5 blossom
  6 cherry
  7 err
  8 he
  9 her
 10 Herr
 11 Io
 12 Los
 13 loss
 14 mow
 15 of
 16 so
 17 we

I know it's a brute force, but it takes less than half second to give this result.

Consider that Perl RegEx engine is using a Boyer-Moore search algorithm, which is probably the fastest one available for text matching.

Replies are listed 'Best First'.
Re: Re: Finding dictionary words in a string.
by toma (Vicar) on Mar 14, 2004 at 05:58 UTC
    This is a great little program!

    I couldn't resist speeding it up a bit. A commonly used trick in search engines is to shift everything to upper case.

    Avoiding the i flag in the regular expression roughly doubles the speed, on my machine at least.

    I also used Storable to create a slightly faster-loading dictionary.

    #!/usr/bin/perl -w use strict; use Storable; my $search_word = uc shift or die "search word required\n"; my @words = (); my @dict; my $rdict; my $count = 0; if (not -e 'words.sto') { open WORDS, '/usr/share/dict/words' or die "can't open words file\n"; while (<WORDS>) { chomp; push @dict, uc $_; } close WORDS; $rdict= \@dict; store $rdict, 'words.sto'; } else { $rdict= retrieve('words.sto'); } for (@$rdict) { printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/; }
    It should work perfectly the first time! - toma

      I can't figure out how you got that speed improvement. The time you save by removing the /i flag is lost when you build a large array in memory (45,000 words in my machine).

      Besides, do you really think an all-uppercase output is better?

      $ time perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
        1 Al
        2 apple
        3 as
        4 ask
        5 blossom
        6 cherry
        7 err
        8 he
        9 her
       10 Herr
       11 Io
       12 Los
       13 loss
       14 mow
       15 of
       16 so
       17 we
      0.30user 0.00system 0:00.30elapsed 99%CPU 
      
      $ time perl wsearch2.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
        1 AL
        2 APPLE
        3 AS
        4 ASK
        5 BLOSSOM
        6 CHERRY
        7 ERR
        8 HE
        9 HER
       10 HERR
       11 IO
       12 LOS
       13 LOSS
       14 MOW
       15 OF
       16 SO
       17 WE
      0.48user 0.00system 0:00.48elapsed 99%CPU 
      

      Of course, I ran both scripts at least 4 times, to make sure that the storable file was created and that both scripts were reading their input from the disk cache.

        Not going to chime in with regards to the array in memory, but with regards to the output difference, just store the word in uppercase, as well. Also, don't forget to Code Smarter. Use index, instead of matching with a regex. That should provide the same functionality with nice speed gains.
        #!/usr/bin/perl use strict; use warnings; my $sw = lc shift or die "search word required\n"; my $count = 0; open WORDS, "/usr/share/dict/words" or die "can't open words file\n"; while (<WORDS>) { chomp; printf "%3d %s\n", ++$count, $_ if index($sw,lc $_) > -1; } close WORDS;
      Here are the timings for my machine. With this type of program, the results will depend on the relative performance of the hard disk, the CPU, the memory, and the caches.

      Even on a single machine, the results depend on the operating system kernel, and most importantly, with the build of perl. My vendor perl is an unusual build, and my own build is more tuned to my machine.

      The kernel I have been running, 2.4.22-6.ll.rh80, is an experimental low-latency kernel used for real-time applications. The file system is ext3.

      Timing, seconds Kernel 2.4.22-6.ll.rh80 Kernel 2.4.18-14
      Vendor perl, loop 2.938 2.569
      My perl, loop 1.957 1.677
      Vendor perl, Storable, no /i 1.267 0.966
      My perl, Storable, no /i 0.687 0.567

      If you prefer lower case to make the output look better, just use the lc function instead of the uc function.

      I think the lesson is that performance improvements are often difficult to generalize, so you have to try it on your own system and see what works.

      UPDATE: The Code Smarter suggestion is very good. For the Storable code, timings for the vendor perl on 2.4.22-6.ll.rh80 the timings are down to about 0.21 seconds and with my perl they are at about 0.13 seconds - too fast for my primitive benchmark.

      It should work perfectly the first time! - toma

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336339]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-19 23:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found