Try this. It uses ~300MB to store and index 1 million (randomly generated) words (14MB on disk). It allows full regex searching (with some adaptions) and never seems to take more than 2 seconds:
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use List::MoreUtils qw[ uniq ];
sub idx{
my $idx = chr(0) x 4;
vec( $idx, $_, 1 ) = 1 for map ord()-ord('a'), uniq sort split'',
+$_[ 0 ];
$idx;
}
my %idx;
open DICT, '<', 'junk.words' or die $!;
while( <DICT> ) {
chomp;
push @{ $idx{ idx( $_ ) } }, $_;
}
my @keys = keys %idx;
print scalar @keys;
while( <> ) {
my $start = time;
my @matches;
my $n = 0;
chomp;
( my $pat = $_ ) =~ tr[a-z][]cd;
$pat = idx( $pat );
for my $idx ( grep+(($_ & $pat) eq $pat), @keys ) {
for my $poss ( @{ $idx{ $idx } } ) {
$poss =~ $_ and $matches[ $n++ ] = $poss;
}
}
printf "Found $n matches in %.2f seconds; Display? ", time() - $s
+tart;
if( <> =~ /y/i ) {
print for @matches;
}
}
A few examples: c:\test>742277.pl
857720
z$
Found 38464 matches in 1.86 seconds; Display? n
zz$
Found 1481 matches in 1.80 seconds; Display? n
zzz$
Found 55 matches in 1.78 seconds; Display? n
^a.*zzz$
Found 3 matches in 1.00 seconds; Display? y
afyjhcukywpbzzz
azhmwxjxncbaozzz
atzzz
[aeiou]{6]
Found 0 matches in 0.39 seconds; Display? n
[aeiou]{6}
Found 99 matches in 0.43 seconds; Display? n
[aeiou]{7}
Found 23 matches in 0.45 seconds; Display? n
[aeiou]{8}
Found 2 matches in 0.43 seconds; Display? y
ouaueeie
acxftoeeoeuiofoj
^[aeiou]+$
Found 3 matches in 0.44 seconds; Display? y
ieouea
iuaoea
ouaueeie
^for
Found 54 matches in 0.67 seconds; Display? n
^for.*ness
Found 0 matches in 0.39 seconds; Display?
^for.*n
Found 14 matches in 0.50 seconds; Display? y
foromnfikqfarwgedn
fornsdlluobiqdmacjl
forhkzmalfewhaohknrl
fornfxdprljcckkh
fortgntqbpbnmmtpk
forkqvimulibcfxwyjnce
formslskcoazusn
fornywxhqt
forndbzjfm
forfnmhhvdcntt
forxhbcimsggnhhmbiqze
foruhvpekgtnialyifyi
forcnmamdsx
forcvxnb
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|