comment on

If you XOR two strings together,

print join'',unpack 'C*',   
  'AGGCGGAAGCACCCAACAGCAACAG' 
^ 'ACACGCAAAAACCGAAGAGAAAGCG';

0460040062000400400200420
[download]

The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not.

If you count the non-null bytes, it gives you a score for the fuzziness of the match.

print $_ = grep $_,unpack 'C*', 
  'AGGCGGAAGCACCCAACAGCAACAG' 
^ 'ACACGCAAAAACCGAAGAGAAAGCG';

10
[download]

To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence.

Putting that all together, I got:

#! perl -slw
use strict;
use bytes;
$| = 1;

our $FUZZY ||= 10;

open FUZ, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
my %fuz;
$fuz{ $_ } = '' while chomp( $_ = <FUZ> );
close FUZ;

print "Loaded ${ \scalar keys %fuz } 25-ers";

open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!";

while( my $seq = <SEQ> ) {
    chomp $seq;

    for my $offset ( 0 .. length( $seq ) - 25 ) { 
        printf "\rProcessing sequence %5d offset %05d", $., $offset;

        for my $fuz ( keys %fuz ) {
            my $m =  grep $_, unpack 'C*', $fuz ^ substr( $seq, $offse
+t, 25 );

            if( $m <= $FUZZY ) {
                
                ## This stores the lineno/offset/fuzziness where each 
+25-er
                ##  matched in a compact form for further process; sor
+ting etc.
                $fuz{ $fuz } .= pack 'nnn', $., $offset, $m;
                
                ## Or just print out the data to a file.
#                print "\nMatched '$fuz' -v- '", 
#                    substr( $seq, $offset, 25 ),
#                "' in line: $. @ $offset with fuzziness: ", $m;
            }
        }
    }
}

close SEQ;
[download]

This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass.

I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker.

If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches?

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks