Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings)

by demerphq (Chancellor)
on Nov 17, 2004 at 21:42 UTC ( [id://408604]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)
in thread Fuzzy Searching: Optimizing Algorithm Selection

Well BrowserUk im afraid the numbers just dont agree with you at all.

I put together a hybrid Trie/DFA solution, (code is in the readmore at the bottom of this node). It shows quite clearly that the Trie/DFA will win in many cases against your XOR approach. Even though my trie implementation is Pure Perl (with an extremely inefficient DFA construction routine) it outperforms your algorithm in every circumstance I tested it in _except_ the 25/1000/1000 tests that you were using. And even then when I make the string being searched just 10 times larger XOR loses outright. Also any possible benefits that may accrue from OS level file caching goes to benefit the XOR algorithm as because the Trie algorithm most often finished first I put it first in the test code. I didnt have sufficient tuits to redo the implementation using Hybrid Perl/Inline::C which would have just increased the difference between the routines.

Basically the performance for the Trie/DFA is nonlinear for construction, and linear for the actual scan. This of course means that once constructed the search time for any given piece of data will theoretically be some constant times the length of the string and in practice we see some pretty much that with some modest change probably due to memory management costs. Alternate and more compact representations could be chosen to make this more efficient. Since the constructed Trie can be serialized for reuse the construction costs could in many situations be essentially ignored. For this reason i seperated out the time actually running the state machine from the time constructing it.

I will admit that there are two subtle differences between the results from the two approaches. Yours will show mutliple hits for a single offset if they exist wheras mine will only show one. I dont consider this in any way to make your solution superior as it is a fairly trivial issue to resolve as one could simply maintain a hash of all the keys in the trie and any other keys that are anagrams of that key. Similarly this particular implementation doesnt necessarily handle strings of different sizes as one might expect. As this is meant to be a proof of concept and the implementation you posted is hardcoded to handle 25 char words and as such suffers the same problem I didn't see that as a show stopper.

I should say lastly that calling this a DFA not be entirely correct. Possibly a better term would FSA (finate state automata). The reason being that normally neither NFA or DFA regex engines will match overlapping words. Ie if the string is "ABCABD" and we have 'BC' and 'CA' as accepting strings then it would match ONLY "BC". My implementation doesnt care about that and would match both. Regardless there are any number of ways to build a state machine for the purpose at hand and mine is probably one of the least efficient.

The following table shows the various test scenarios that I used. (All times are in seconds.) The last column shows how much longer the XOR solution took than the Trie solution. Where the number is negative (and in bold) is where the XOR based solution beats the Trie based soltion. You'll notice that only one scenario is in this category: that being when the strings are 1000 chars and the words are 25 chars. A ratio of 1:40. However once we use a ratio of 1:100 or so this reverses itself quite drammatically.

The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours. For something like searching vast quantities of DNA strings I imagine a technique like this would do quite well. OTOH this is a trade off. I can imagine scenarios where your approach would be better. But for hardcore pattern matching on modern machines with lots of ram id be looking at a DFA. Notice that by the time we start talking about 100k being searched for 10 words the trie runs in 26 minutes less. And this is Pure Perl and in the opinion of the author a pig. :-)

CriteriaXorTrieScanConstructXor / Trie
Word SizeString SizeWord Count
10100013.84722.18922.03780.15141.6579
10100025.46132.42212.09710.32503.0392
101000510.41363.12302.45610.66697.2906
1010001018.86923.80362.59791.205815.0656
1010000136.721619.734019.58020.153816.9876
1010000253.507120.274219.96360.310633.2329
10100005103.607823.030422.35370.676680.5774
101000010182.262425.667724.40781.2599156.5947
101000001375.8999198.2450198.07120.1737177.6549
101000002544.4644205.7155205.38580.3297338.7488
1010000051027.3750219.6740218.94980.7242807.7010
10100000101829.9289246.2426244.82401.41851583.6864
25100013.86538.06402.05746.0066-4.1987
25100025.659713.61442.101511.5129-7.9548
251000511.808632.73902.376430.3625-20.9304
2510001019.633663.89282.603661.2892-44.2592
2510000138.330426.971419.88837.083211.3590
2510000256.597132.952221.328111.624123.6449
25100005108.095852.962723.174829.787955.1330
251000010196.670980.124024.956955.1671116.5469
251000001393.9894202.1637196.50245.6614191.8256
251000002576.6479218.8988205.974712.9241357.7492
2510000051067.7821254.2639223.547230.7167813.5182
25100000101959.1590315.1286249.972365.15631644.0305
use strict; use warnings; use Time::HiRes qw(time); use POSIX qw(strftime); use Getopt::Long; use vars qw/ $VERBOSE /; $VERBOSE=0; exit(main(@ARGV)); BEGIN { # all of this gunk is for timing and logging stuff my @time=(time); my @comment=('Start'); sub mark_time { push @time,time; push @comment,join "",@_; print @_, sprintf " <%9.4f secs>\n",$time[-1]-$time[-2] if $VERBOSE; return $time[-1]; } sub clear_time { push @time,time; push @comment,'Finish'; @comment=map { s/[\t\r\n]/ /g; $_} @comment; printf STDERR "%%#%% %s %s %s\n",strftime("%H:%M:%S",localtime($ti +me[$_])), $_ ? sprintf " (%9.4f : %9.4f)",$time[$_]-$time[$_-1],$time[$_ +]-$time[0] : '###', ($comment[$_]||'') for 0..$#time; my $total=$time[-1]-$time[0]; print "Total Elapsed Time: ".$total." secs\n"; @time=(time); @comment=(shift||'Start'); return $total; } sub elapsed { my $time=shift; $time||=$time[-1]; print @_, sprintf " [%9.4f secs]\n",time-$time; return time; } END { clear_time(); } } sub search { my ($node,$string,$idx,$maximum)=@_; $idx=0 unless defined $idx; my @match; my $matches; $maximum||=50; #print "\nsearch($node,$string,$idx)\n" if $VERBOSE<0; for (;$idx<length($string);$idx++) { my $char=substr($string,$idx,1); #print "$idx : $char : $node->{id} : ",($node->{value}?$node->{val +ue}:"''"),"\n" # if $VERBOSE<0; if ( $node->{value} ) { push @match,[$idx-length($node->{string}),${$node->{value}},$nod +e->{fuzz},$node] if $matches++<$maximum; } $node=$node->{$char}||$node->{-$char}; die $char,$idx unless $node; } if ( $node->{value} ) { push @match,[$idx-length($node->{string}),${$node->{value}},$node- +>{fuzz},$node] if $matches++<$maximum; } return $matches,@match; } BEGIN { my $IDX=1; sub get_idx(){"$IDX"} # this builds the trie sub hash_trie_store { my ($node,$string,$value,$fuzz)=@_; $node->{id}||=$IDX++; my $dot=0; foreach my $char (split //,$string) { unless ($node->{$char}) { $node->{$char}={id=>$IDX++}; } $node=$node->{$char}; } $node->{value}=\$_[2]; $node->{fuzz}=$fuzz if defined $fuzz; $node->{string}=$string; #print "Store: $string:$value:$fuzz\n"; $node->{id} } } # this makes fuzzy words from a string and then inserts into the trie. sub make_variants { my ($string,$depth,$chars,$hash,$trie,$rdepth,$rstring)=@_; my $ltrs=ref $chars ? $chars : [ split //, $chars ]; my @char=split//,$string; $trie||={}; $hash||={}; $rdepth||=0; my $count=0; my @follow; $rstring||=$string; if (!$rdepth) { $hash->{$string}=hash_trie_store($trie,$string,$rstring,$rdepth) +; $count++; #print "* $string, $rdepth\n"; } foreach my $idx (0..$#char) { foreach my $alt (@$ltrs) { next if $alt eq $char[$idx]; local $char[$idx]=$alt; my $str=join "",@char; if (!$hash->{$str}) { $hash->{$str}=hash_trie_store($trie,$str,$rstring,$rdepth+1); $count++; push @follow,$str if $depth>1; #print "$str, ",$rdepth+1,"\n"; } } } foreach my $str (@follow) { $count+=make_variants($str,$depth-1,$ltrs,$hash,$trie,$rdepth+1,$ +rstring); } return $count; } # this converts a trie into a "dfa" capable of doing the overlapping m +atching # note it doesnt handle strings of differing size correctly so dont us +e it for that. sub traverse { my ($root,$chars,$str,$hash)=@_; $hash||=$root; $str="" unless defined $str; my $ltrs=ref $chars ? $chars : [ split //, $chars ]; foreach my $k (@$ltrs) { if ($hash->{$k}) { traverse($root,$chars,$str.$k,$hash->{$k}); } else { if (length($str)>1) { my @chars=(split(//,$str),$k); while (defined shift @chars) { my $csr=$root; for my $c (@chars) { last unless $csr=$csr->{$c}||$csr->{-$c}; } if ($csr) { $hash->{-$k}=$csr; last } } } $hash->{-$k}||=$root; } } return } # slighlty modified version of BrowserUk's code some changes necessary + becuase # we allow for differing length words and also because Perl 5.6.2 comp +lained. sub xor_fuzz { my ($length,$FUZZY,$fuzzfile,$stringfile,$maximum)=@_; use bytes; $FUZZY||=2; $maximum||=50; mark_time "xor_fuzz loading '$fuzzfile'\n"; open my $FUZ, '<', $fuzzfile or die "$fuzzfile : $!"; my %fuz; while( <$FUZ> ) { chomp; $fuz{ $_ } = ''; } close $FUZ; mark_time "Loaded ${ \scalar keys %fuz } $length-ers\n"; open my $SEQ, '<', $stringfile or die "$stringfile : $!"; my $totalLen = 0; my $fuzzyComps = 0; while( my $seq = <$SEQ> ) { my @matches; my $matches=0; chomp $seq; $totalLen += length $seq; for my $offset ( 0 .. length( $seq ) - $length ) { my $ssref = \substr( $seq, $offset, $length ); #printf STDERR "\rProcessing sequence %5d offset %05d", $., +$offset; for my $fuz ( keys %fuz ) { $fuzzyComps++; my $m = $length - (my $x=( $fuz ^ $$ssref )) =~ tr[\0][ +\0]; if( $m <= $FUZZY ) { push @matches,[$offset,$fuz,$m] if $matches++<$maximum; } } } mark_time sprintf "XOR #%04d: %04d > %s",$.,$matches,@matches ? + ": ". join(", ",map{ sprintf "%06d:%s:%d",@{$_}[0 +..2] } @matches) : "None."; } mark_time "\n\nProcessed $. sequences\nAverage length: ", $totalLen / $.,"\n"."Total fuzzy comparisons: ", $fuzzyComps,"\n"; close $SEQ; } #utility function sub write_file(&@) { my ($sub,$file,$count,$override)=@_; return if !$override and -e $file; open my $fh,">","$file.tmp" or die "Error writing to '$file.tmp':$!" +; print "Writing '$file'\n" if $VERBOSE; for (1..$count) { print $fh $sub->(),"\n" or die "Failed to print to '$file'\n"; } close $fh or die "Failed to close '$file'\n"; rename "$file","$file.bak" or die "Backup Rename failed:$!" if -e $file; rename "$file.tmp","$file" or die "Tmp Rename failed:$!"; mark_time "File Complete: '$file'\n"; } sub main { $|++; my $Fuzz=2; my $Words=1; my $Word_Size=25; my $Strings=1000; my $String_Size=1000; my $OverWrite=0; GetOptions('fuzz=i' => \$Fuzz, 'words=i' => \$Words, 'strings=i' => \$Strings, 'word_size=i' => \$Word_Size, 'string_size=i'=> \$String_Size, 'over_write!' => \$OverWrite, 'verbose!' => \$VERBOSE) or die "Bad Options\n"; my @Chars=qw(A C T G); my $Word_File=sprintf "Fuzz-Words-W%04d-S%04d-WC%04d-SC%04d.fuzz",$W +ord_Size,$String_Size,$Words,$Strings; my $String_File=sprintf "Fuzz-Strings-W%04d-S%04d-WC%04d-SC%04d.fuzz +",$Word_Size,$String_Size,$Words,$Strings; rename $String_File,"$String_File.bak" if -e $String_File and !-e $Word_File; my @words; write_file { my $str=join "",map { $Chars[rand @Chars] } 1..$Word_Size; push @words,$str; $str; } $Word_File,$Words,$OverWrite; print "Getting strings\n" if $VERBOSE; write_file { my $str=$Chars[rand @Chars] x $String_Size; substr($str,$_-1,1)=$Chars[rand @Chars] for 1..$String_Size; for (1..$Word_Size) { my $p=int rand($String_Size-$Word_Size); my $w=$words[rand @words]; substr($str,$p,$Word_Size,$w); } $str; } $String_File,$Strings,$OverWrite; mark_time "Finished building strings\nWords: $Word_File\nStrings: $S +tring_File\n"; clear_time("Starting TRIE\n"); my $construct_time; { # Trie search @words=do{ open my $ifh,"<",$Word_File or die "Can't read Word File '$Word_File':$!"; map { chomp; $_ } <$ifh> }; my $time=mark_time "Got ".scalar(@words)." to fuzzyify\n"; print "Making fuzzy strings\n" if $VERBOSE; my (%trie,%hash); foreach my $word (@words) { print length($word).":$word:" if $VERBOSE; my $count=make_variants($word,$Fuzz,\@Chars,\%hash,\%trie); $time=elapsed($time,$count) if $VERBOSE; } @words=sort keys %hash; mark_time "Fuzzy Words:".scalar(@words)." [".get_idx()." nodes in +tree]\n"; #exit(0); print "Doing traverse...\n" if $VERBOSE; traverse(\%trie,\@Chars); mark_time "Finised Traverse"; $construct_time=clear_time("Starting Trie Scan\n"); open my $fh,"<",$String_File or die "Error reading stringfile '$String_File'\n"; while (<$fh>) { chomp(my $string=$_); my ($count,@matches)=search(\%trie,$string); mark_time sprintf "TRIE #%04d: %04d > %s",$.,$count,@matches ? " +: ". join(", ",map{ sprintf "%06d:%s:%d",@{$_}[0 +..2] } @matches) : "None."; } mark_time "Trie Done\n"; } my $scan_time=clear_time("Starting xor_fuzz search\n"); { # Xor Search xor_fuzz($Word_Size,$Fuzz,$Word_File,$String_File); } my $xor_total=clear_time("xor_fuzz search done.\n"); print STDERR "**** WordSize: $Word_Size StringSize: $String_Size Wor +dCount: $Words StringCount: $Strings Xor: $xor_total Trie: " .($scan_time+$construct_time)." ($scan_time + $construct_time) +\n"; print "**** WordSize: $Word_Size StringSize: $String_Size WordCount: + $Words StringCount: $Strings Xor: $xor_total Trie: " .($scan_time+$construct_time)." ($scan_time + $construct_time) +\n"; 0 }

And here is how it is meant to be used: (note that it writes a special log to STDERR and it also will only print out the first 50 matches for either soltion so as not to overflow memory with such redundant data).

del *.fuzz trie_xor.pl --word_size=10 --string_size=1000 --words=1 --over_write 2 +>fuzzlog-10-1000-01.fuzzlog trie_xor.pl --word_size=10 --string_size=1000 --words=2 --over_write 2 +>fuzzlog-10-1000-02.fuzzlog trie_xor.pl --word_size=10 --string_size=1000 --words=5 --over_write 2 +>fuzzlog-10-1000-05.fuzzlog trie_xor.pl --word_size=10 --string_size=1000 --words=10 --over_write +2>fuzzlog-10-1000-10.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=1 --over_write +2>fuzzlog-10-10000-01.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=2 --over_write +2>fuzzlog-10-10000-02.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=5 --over_write +2>fuzzlog-10-10000-05.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=10 --over_write + 2>fuzzlog-10-10000-10.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=1 --over_write + 2>fuzzlog-10-100000-01.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=2 --over_write + 2>fuzzlog-10-100000-02.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=5 --over_write + 2>fuzzlog-10-100000-05.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=10 --over_writ +e 2>fuzzlog-10-100000-10.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=1 --over_write 2 +>fuzzlog-25-1000-01.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=2 --over_write 2 +>fuzzlog-25-1000-02.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=5 --over_write 2 +>fuzzlog-25-1000-05.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=10 --over_write +2>fuzzlog-25-1000-10.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=1 --over_write +2>fuzzlog-25-10000-01.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=2 --over_write +2>fuzzlog-25-10000-02.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=5 --over_write +2>fuzzlog-25-10000-05.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=10 --over_write + 2>fuzzlog-25-10000-10.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=1 --over_write + 2>fuzzlog-25-100000-01.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=2 --over_write + 2>fuzzlog-25-100000-02.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=5 --over_write + 2>fuzzlog-25-100000-05.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=10 --over_writ +e 2>fuzzlog-25-100000-10.fuzzlog

A footnote: you mentioned a number of things about a DFA/Trie not being able to do. I suggest to you that you were mostly wrong. Adding meta data to states in a DFA is relatively easy to do. For instance accepting states can have associated data (in fact in the implementation above its the very presence of that data that indicates its an accepting state.)

---
demerphq

  • Comment on Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings)
  • Select or Download Code

Replies are listed 'Best First'.
Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!)
by BrowserUk (Patriarch) on Nov 18, 2004 at 08:24 UTC
    Well BrowserUk im afraid the numbers just dont agree with you at all.

    No. Your figures don't agree. Your very selective figures.

    The OP presented the question in the form

    The basic problem:

    Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:

    1. exact
    2. fuzzy (1- and 2-letter mismatches)

    within the dictionary.

    Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.

    I've highlighted the (IMO, which is as good as any other as the OP hasn't returned to clarify matters) significant figures in that quote.

    Your table is very interesting. Particularly in what it doesn't show.

    The most important point is that your algorithm does not scale!

    The first thing I noticed was that the biggest number in the OPs presentation of the problem, was the one you chose to vary through the most limited range?

    He specified, clearly, 100,000 25-char keys ("words" as you refer to them). You only varied this value through a range of 1, 2, 5 & 10. I wondered why?

    The following is the output from your unmodified program, apart from the addition of one extra line of diagnostics which I'll post at the bottom.

    [ 5:40:10.75] P:\test\demerphq>del *.fuzz rem Ensure a clean plate [ 5:40:16.21] P:\test\demerphq>dir Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test\demerphq 18/11/2004 05:40 <DIR> . 18/11/2004 05:40 <DIR> .. 18/11/2004 05:38 9,597 demerphq.pl 1 File(s) 9,597 bytes 2 Dir(s) 48,390,365,184 bytes free [ 5:40:21.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=1 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor: 4.453125 Trie: 10.390625 (1.859375 + 8.53125) **** perl.exe 416 0 12,320 K [ 5:40:43.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=10 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 10 StringCount: 1000 Xor: 21.0625 Trie: 95.796875 (2.328125 + 93.46875) **** perl.exe 3196 0 78,164 K [ 5:42:51.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=100 2>&1 | find "****" #### The above run self-terminated (crashed) because it was out of mem +ory!!! #### [ 5:49:20.62] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=20 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 20 StringCount: 1000 Xor: 41.453125 Trie: 195.71875 (2.5 + 193.21875) **** perl.exe 2708 0 149,812 K [ 5:53:46.56] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=30 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 30 StringCount: 1000 Xor: 59.59375 Trie: 293.75 (2.59375 + 291.15625) **** perl.exe 3504 0 222,616 K [ 6:00:00.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=40 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 40 StringCount: 1000 Xor: 79.265625 Trie: 404.96875 (2.734375 + 402.234375) **** perl.exe 3412 0 293,504 K [ 6:13:48.76] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=50 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 50 StringCount: 1000 Xor: 97.546875 Trie: 494.3125 (2.796875 + 491.515625) **** perl.exe 3524 0 366,868 K

    As you can see, I filtered the output to show just the headline numbers, and the extra line of diagnostics I added. I've also hand wrapped the output for the usual reasons.

    The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours.

    You even predicted the problem, but dismissed it

    If you look at the third run I attempted above, you'll see that I annotated it as running out of memory. I did a little math based on the memory consumption figures output using the MS tasklist.exe utility.

    1. 1 x 25-char key requires 12 MB.
    2. 10 x 25-char keys (words) requires 78 MB
    3. 20 x 25-char keys (words) requires 149 MB (delta for 10 words: 71 MB.)
    4. 30 x 25-char keys (words) requires 222 MB (delta for 10 words: 73 MB.)
    5. 40 x 25-char keys (words) requires 293 MB (delta for 10 words: 71 MB.)
    6. 50 x 25-char keys (words) requires 366 MB (delta for 10 words: 73 MB.)

    That's pretty consistant, and linear, at > 70 MB / 10 keys.

    So, calculating the RAM requirement to build the MegaTrie DFA that your algorithm depends on for it's speed, comes out to:

    100,000 x 25-char keys (words) requires 696 Gigabytes of RAM.

    As no $5000 system I am aware of is capable of addressing that volume of ram, the question of whether the sequences are 1,000-chars or 10,000-chars long, and which would be the faster algorithm if that was the OPs requirement, simply doesn't arise--your algorithm doesn't get going, because you cannot build your MegaTrie/DFA!

    The Xor algorithm does handle "several hundred thousand keys easily.

    Update: I forgot to post the one line I changed in your program:

    ## This: exit(main(@ARGV)); ## became this: main(@ARGV); print "**** $_" for `tasklist /NH /FI "PID eq $$"`; exit(0);

    Type tasklist /? on your local XP system for an explaination of the parameters.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      The Xor algorithm does handle "several hundred thousand keys easily.

      No the Xor algorith does NOT handle several hundred thousand keys easily. The rate of growth on the run time for searching a string 100k long is too high. Look at the numbers, the algorithm gets slower the more keys there are involved. So slow that you could construct Trie/DFA for a subset and run them in series and still outperform your algorithm. I chose the numbers I chose becuase if your algorithm blows out at those numbers it has already totally lost the game.

      Look at the numbers. For a 10 x25 char words searching 100k strings my solution took 5.25 minutes. Your solution took half an hour. So you could run mine 6 times and yours still wouldnt have finished. Thus mine would do 60 words in 30 minutes while yours was still struggling with 10 words. Adding words appears to make your solution grow faster than mine.

      Now mine could be made even more efficient. As I said the construction routine im using is far from optimal. Im pretty sure it can be made to run in linear time to the number of keys. Not only that but all of your memory calculation are like your other calculations: misleading. An optimized respresentation for this problem should require only 20 bytes per node (5 * length(long)). Consider a single empty hash and the reference to it (of which there are hundreds of thousands in use in my prototype code) takes more memory than that, at least double as an SV itself takes 28bytes. So the scale of problems this solution can handle in production quality code gets much much larger than the toy implementation here.

      Anyway, I made an assertion in my original node in this thread. Ive now proved it with _real_ numbers and working code. You claimed it was nonsense. It has proved to not to be. And all of your wriggling and trying to rewrite the situation to make your claims true dont work. I made an assertion and my assertion has proved true. Within the parameters it is feasable to use a DFA/Trie it will smoke your Xor solution for cases where the string being searched is fairly long. Your objections to that claim have proved incorrect, as has your math.

      You confidentally calculated that it would take years for a solution like this to run and you have been proved utterly wrong. Even if you went with the current code and ran it 100 times it still would only be 525 minutes to search for 1000 words over 1000x100k strings. Looking at the run time between 5 words and 10 words over 1000x100k strings for your solution we see a 1000 second increase. Since your operation will scale proportionally to the number of words and the length of the string we can conclude that to process 1000 words on the same data set will take your solution about 3000 minutes. Which is 2500 minutes LONGER than mine. Thats 41 hours longer. And you want to go this route with 100k words? Using the same logic 100k words would take your solution 20 million minutes, thats ~38 years. Versus the approximately 36 days it would take mine. I know which one i would prefer to wait for.

      To make this picture even worse for your solution id like to point out that in my earlier code I was effectively doing $Fuzzy=5 by accident and it was still outpacing your solution. That added tens of thousands of words to the search list. If it was possible to do that with a pure perl soltion imagine what a String/Struct/Inline::C/Perl hybrid would do. Using a small fraction of the ram and doing the search operations in C means that my solution just gets faster.

      All told I think its quite clear that I have proved my point in my original post. A state machine based scan will outperform anything posted in this thread. And it has. I know you will come back with more of the same here, but I'm no longer interested. You insulted me and you posted a bunch of nonsense math saying im all wrong. And now when i prove quite conclusively that you were wrong have you gone an updated any of that to point out its wrong? Have you added any caveats anywhere? Isnt this why you were insulting and abusive in the CB to me? You objected to me assertion my proposal was more efficient than yours and not being more clear in the caveat I added at the bottom, what about your nodes? They are full of totally nonsensical equations and arguments supposedly proving how my approach just cant work, but hey it outperforms yours. Thats a fact that you cant get around no matter how hard you try.

      ---
      demerphq

Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!)
by BrowserUk (Patriarch) on Nov 27, 2004 at 13:24 UTC

    demerphq The numbers in your table are meaningless.

    1. Your solution is broken.

      To show this, I ran your test harness (slightly modified to accept filenames from the command line and disable the "only output the first 50 matches" code) on two small testfiles. The first was generated using this small script which generates all the variations of a specified length key; I used 10-char keys (406 variations) because your prototype code cannot handle 25-chars (2701 variations).

      #! perl -slw use strict; our $LEN ||= 10; my $s = 'A' x $LEN; print $s; for my $p1 ( 0 .. $LEN -1 ) { for my $c1 ( qw[ C G T ] ) { for my $p2 ( $p1 .. $LEN -1 ) { next if $p1 == $p2; for my $c2 ( qw[ C G T ] ) { my $t = $s; substr $t, $p1, 1, $c1; substr $t, $p2, 1, $c2; print $t; } } } } __END__ P:\test\demerphq>keyGen -LEN=10 >test.words P:\test\demerphq>wc -l test.words 406 test.words

      Here i've printed the first and last 10 to show the sequence:

      P:\test\demerphq>u:head test.words AAAAAAAAAA CCAAAAAAAA CGAAAAAAAA CTAAAAAAAA CACAAAAAAA CAGAAAAAAA CATAAAAAAA CAACAAAAAA CAAGAAAAAA CAATAAAAAA P:\test\demerphq>u:tail test.words AAAAAAATAT AAAAAAAACC AAAAAAAACG AAAAAAAACT AAAAAAAAGC AAAAAAAAGG AAAAAAAAGT AAAAAAAATC AAAAAAAATG AAAAAAAATT

      I used the following, 11-char single sequence for the test:

      P:\test\demerphq>type test.seq AAAAAAAAAAA

      The thing to note is that every one of the 406 keys in test.words will match that single 11-char sequence. Twice, once at offset 0, and again at offset 1.

      Now the run.

      The explaination of why your code only finds two matches, and attributes them to exact matches and mine finds those two + 810 more?

      Simple. Your Trie implementation will only report the first match it finds at any given position. If you delete the first key in test.words 'AAAAAAAAAA', then it will still report finding 'AAAAAAAAAA'! because every other key in that file can be considered a fuzzy match for 'AAAAAAAAAA'.

      What this means is that your code is only doing a fraction of the work that mine is doing (and that is required). In the case of the 10-char keys above, you code is only doing 1/406th (2.5%) of the required work. By the time you get to 25-char keys, this drops to 1/2701th (0.03%).

      For the explaination of those numbers, you need only go back and read my analysis , and make some attempt to understand those "...totally nonsensical equations and arguments supposedly proving how my approach just cant work, ..."

      So, for your "...but hey it outperforms yours..." claims. If you only do a tiny fraction of what is required--sure you'll do that quicker. Even that claim is bogus if you cannot build the datastructure required to achieve your "performance".

    2. Update: Section removed. I screwed up. 'B' is not a part of the alphabet 'ACGT' under test.
    3. There are myriad other things wrong with both your algorithm and your implementation of it.
      • Your home-brew timing code incorporates it's own runtime into the things under test.
      • Your logging code--in particular, the code that limits the number of matches output, for no good reason other than beautification--imposes a substantial overhead on both our algorithms.
        • By accumulating the matches in an array, instead of outputting them as the are found.
        • By spending time formatting (push @time, join, @_, sub sprintf ... join, map{ sprintf } @matches ) them into a single line, rather than just printing them!

        If you get around to performing your timings without that code in place you'll find that not only does it markedly alter the results, but that the impact of it, due to the differences in our algorithms (and nesting levels at which it is done), is disproportionate. I'll let you work out for yourself which algorithm is most affected.

      • Your protoype -v- production, Perl --v- C claims "...imagine what a String/Struct/Inline::C/Perl hybrid would do. Using probably less than 1/100th of the ram..." (since silently modified.) are also totally without basis.

        Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), nearly an order of magnitude 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor.

    All in all--eminently worthy of my "Utterly bollocked" private /msg.


    I could gone on, but then this would seem like a hatchet job. It isn't. I've expended the best part of this last week on this analysis, on top of the time I put into that I had already done. That you so readily dismissed as "...totally nonsensical equations and arguments...".

    This is the 5th revision of this document. Each time I've attempted to remove the vitriol from it. Reading it back now I see I still probably haven't achieved that as well as people will expect. Sorry, but every time I read this and this, it gets me so hot under the collar that it just comes out. So be it, whatever is left will stand. If this amount of effort is rewarded with downvotes that's fine by me.

    Two final comments:

    1. It is a truism that "War always causes an arms race; technology always advances faster under war than under peace".

      And so it is. As a result of your pushing, and the penny clicking that an optimisation alluded to in the OP (Option #2), I have now improved the performance of my algorithm, especially on long sequences (strings) by almost an order of magnitude compared to the code I previously published. That still means that the OPs stated requirements will take a substantial amount of time, but then they are substantial requirements.

    2. I probably wouldn't have pursued that optimisation had you not so incensed me.

      Imagine what we could do if we worked together, instead of against each other? Of course, it seems to be human nature that we (man in general; and you and I specifically), cannot make this transition.

      I think that I see a way to combine the XOR method with a (series of small) Tries such that it would achieve an even faster solution whilst staying within the realms of reality as far as memory consumption is concerned. I've implemented a generic, pure Perl Trie that uses substantially less memory (though it is slightly slower than yours).

      If things pan out, you might hear more on that. In any case, the Trie implementation has some rather interesting other uses, so you might here more about that anyway.


    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), nearly an order of magnitude 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor.
      So do it in multiple passes, with as many keywords as fit comfortably in memory. As you say, the problem has substantial requirements.

      And surely the bulk of the problem is to find which offsets match; identifying the particular strings that match at a given offset, if this is indeed necessary, is much easier and can be a separate step.

        Okay. Let's assume for the moment that we've successfully par'ed down the memory consumption of the Trie to 1/20th of the prototypes usage. That means that we can now successfully build a Trie from 1000 keys instead of 50. For the 100,000 keys you need 100 passes of the 30,000 sequences.

        The performance of the Trie comes from trading the time consumed building the Trie, against the speed of scanning the sequences; and amortising the build time across the sequences scanned.

        But now we have to build 100 trees. Thus we've raised that overhead by 2 orders of magnitude. On top of that, we now have to scan all the sequences 100 times; multiplying the scan time by 2 orders of magnitude.

        If you take a look at the numbers (in the program output), you'll see that building a Trie with a single 25-char words takes around 8.5 seconds; and for 50 x 25-char words 491 seconds (9.84 sec/word). Based on those numbers*, you can expect building a Trie of 1000x25-char words to take at least 2 hrs 21 minutes (8.5secs * 1000). Building 100 of those will take seven months! And that's before we start scanning the 30,000 sequences 100 times each.

        *It is possible that this build-time might be reduced by moving to a C-implementations. It's also possible that it would not. Most of the time taken building the Trie comes not from pointer chasing, or anything else that Perl is significantly slower than C for. Most of the time stems from memory allocation and management. Perl is actually very good at this. Indeed, the fastest Perl builds are those that use Perls built-in malloc/free routines in preference to those in your local C-library. Perl does this well--and reliably. Any savings that might comes from implementing the memory management in C would come at the expense of reliability and at considerable develoment cost. Doing memory management in C is hard. Perl's routines have had many years of evolution. It would be the height of arrogance to think that you are going to do this better at your 1st or 2nd or even 5th attempt.

        identifying the particular strings that match at a given offset, if this is indeed necessary, is much easier and can be a separate step.

        Now you're faced with the task of comparing each of the matching substrings against each of the 100,000 keys and then determining if it is a 0, 1, or 2-mismatch for each of them.

        And how will you make the fuzzy-level determination? Fall back to the Xor method?

        The question of whether it is important to know what level of mis-match, each of the matched substrings matched with, is an open question; that only the OP can answer. I think it likely is, but I'm guessing.

        However, knowing which of the 100,000 keys the substring matched against is important. As I said before, I'm not a Biochemist or a Genetic Engineer, but I find it extrememly unlikely that only knowing that the 25 base-pairs at offset:3129 of sequence:29,001, matched(possibly exactly, possibly with one mis-match;; possibly with 2 mis-matches) against one of these 100,000 x 25-base pair 'bits', is likely to be of any great use in solving the original problem?

        This is the same conclusion I reached when I originally looked at using Tries for this problem in Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.). To quote myself:

        MegaTrie

        Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.

        The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly-- but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).


        Examine what is said, not who speaks.
        "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
        "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      You havent proved anything here. (Except that you dont understand how my solution works, the list you built meant my code was trying to match all _4_ digit fuzzy strings, and not the _2_ digits we were originally discussing, this presumably is where you get the misconception that it wont handle 25 digit keys.) The behaviour you have posted here is exactly as I predicted. And its correct. As I stated in my earlier post and as ysth pointed out as well its trivial to determine what keys also match if a given key matches. So you havent proved anything here. I could have written the code so that instead of outputing only the literal string matched it would output a precalculated list of all the possible variants. Which actually brings me to yet another mathematical problem with your code. Your list omits the 30 words that have only 1 digit different.

      Point in fact you say my idea doesnt work. So I posted working code. Then you say its broken, when in fact it does exactly what it was advertised to do. Face it BrowserUk you are and were wrong. Your XOR approach is slow, so slow as to be useless for nontrivial searches. The state machine is fast, and where it has an up-front overhead can be reused over and over. No matter what you do you arent going to get around that. Sorry.

      I suggest you drop this, all you are doing is embarrasing yourself now.

      ---
      demerphq

        ...the list you built meant my code was trying to match all _4_ digit fuzzy strings, and not the _2_ digits we were originally discussing...
        01234567890 AAAAAAAAAA AAAAAAAAAAA ========== Offset 0 -- 0 mismatches. 01234567890 AAAAAAAAAA AAAAAAAAAAA ========== Offset 1 -- 0 mismatches. 01234567890 CCAAAAAAAA AAAAAAAAAAA xx======== Offset 0 -- 2 mismatches. 12345678901 CCAAAAAAAA AAAAAAAAAAA xx======== Offset 1 -- 2 mismatches. ## 806 other 2-mismatch matches your code fails to find 12345678901 AAAAAAAATT AAAAAAAAAAA ========xx Offset 0 -- 2 mismatches. 12345678901 AAAAAAAATT AAAAAAAAAAA ========xx Offset 1 -- 2 mismatches.

        QED. 2 not 4.

        Perhaps you should try this

        Update: And don't you realise that the list of 4 character mismatches would include all of the 2-char mismatches? And the 1-char mismatches? And the 3-char mismatches? And you code found what? Just TWO exact matches...pull the other one.


        Examine what is said, not who speaks.
        "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
        "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://408604]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 23:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found