note
roboticus
<p>[shamshersingh]:</p>
<p>Here's what I tried:</p>
<c>#!/usr/bin/perl
use strict;
use warnings;
my %H;
open my $FH, '<', 'DNA_strings.dat' or die $!;
while (<$FH>) {
s/\s+$//;
for my $i (0 .. length($_)-1) {
my $k = $_;
substr($k,$i,1) = '*';
push @{$H{$k}}, $_;
}
}
for my $k (sort keys %H) {
if ($#{$H{$k}} > 1) {
print "$k\t", join(",\n\t\t", @{$H{$k}}), "\n";
}
}
</c>
<p>A quick experiment with a short file:</p>
<c>
$ cat DNA_strings.dat
CTGAG
CGAGT
ACGCT
TATAC
CTGAA
GGAGC
ATACA
AAAAA
ACAAA
AGAAA
AATAA
AAAGA
ACCAA
AGCAC
CCACG
GCCAT
AGCAA
GGCAT
GTTTG
$ perl DNA_cmp.pl
A*AAA: AAAAA, ACAAA, AGAAA
A*CAA: ACCAA, AGCAA
AA*AA: AAAAA, AATAA
AAA*A: AAAAA, AAAGA
AC*AA: ACAAA, ACCAA
AG*AA: AGAAA, AGCAA
AGCA*: AGCAC, AGCAA
CTGA*: CTGAG, CTGAA
G*CAT: GCCAT, GGCAT
</c>
<p>It seems pretty fast, too. It took less than a minute to scan through 100,000 20-character strings, but it didn't find anything. (gen_random_strings.pl is on my [pad://])</p>
<c>
$ perl gen_random_strings.pl 100000 20 20 ACGT >DNA_strings.dat
$ time perl DNA_cmp.pl
real 0m47.659s
user 0m46.395s
sys 0m1.252s
</c>
</p>
<p><b>Update:</b> I tried to make this node a reply to the OP, but for some reason, it kept failing on me, and when I'd refresh, it would show my node replacing [BrowserUk]'s node. It was odd looking...</p>
<p>...[roboticus]</p>
<p><i>When your only tool is a hammer, all problems look like your thumb.</i></p>
937241
937243