If you XOR two strings together,
print join'',unpack 'C*',
'AGGCGGAAGCACCCAACAGCAACAG'
^ 'ACACGCAAAAACCGAAGAGAAAGCG';
0460040062000400400200420
The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not.
If you count the non-null bytes, it gives you a score for the fuzziness of the match.
print $_ = grep $_,unpack 'C*',
'AGGCGGAAGCACCCAACAGCAACAG'
^ 'ACACGCAAAAACCGAAGAGAAAGCG';
10
To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence.
Putting that all together, I got:
#! perl -slw
use strict;
use bytes;
$| = 1;
our $FUZZY ||= 10;
open FUZ, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
my %fuz;
$fuz{ $_ } = '' while chomp( $_ = <FUZ> );
close FUZ;
print "Loaded ${ \scalar keys %fuz } 25-ers";
open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!";
while( my $seq = <SEQ> ) {
chomp $seq;
for my $offset ( 0 .. length( $seq ) - 25 ) {
printf "\rProcessing sequence %5d offset %05d", $., $offset;
for my $fuz ( keys %fuz ) {
my $m = grep $_, unpack 'C*', $fuz ^ substr( $seq, $offse
+t, 25 );
if( $m <= $FUZZY ) {
## This stores the lineno/offset/fuzziness where each
+25-er
## matched in a compact form for further process; sor
+ting etc.
$fuz{ $fuz } .= pack 'nnn', $., $offset, $m;
## Or just print out the data to a file.
# print "\nMatched '$fuz' -v- '",
# substr( $seq, $offset, 25 ),
# "' in line: $. @ $offset with fuzziness: ", $m;
}
}
}
}
close SEQ;
This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass.
I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker.
If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches?
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.