Re^2: FInding the longest match from an initial match between two files

Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT
Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA

The above is the line1 and line2 from your sequence sample. The first shows in red and blue 2 matches from the regex.

In the second identical set, you can see (in red), a match which is 1 character longer than the longest match (in red, above).

My question is why the regex made 2 captures here instead of the optimal match in the second (10 chars instead of 9).

The code which accidentally found this was:

my $xor = $file1contents ^ $file2contents;

my $max = 0;
my $max_str;
my $pos;
while ($xor =~ /(\0+)/g) {
    my $len = length $1;
    if ($len > $max) {
        $max = $len;
        $max_str = substr $file1contents, $-[0], $len;
        $pos = $-[0];
    }
    #print "matched $-[0] ", substr $file1contents, $-[0], $+[0] - $-[
+0];
}

print "at pos $pos max string is $max_str";
[download]

Comment on Re^2: FInding the longest match from an initial match between two files Download Code

Replies are listed 'Best First'.

Re^3: FInding the longest match from an initial match between two files
by tybalt89 (Monsignor) on Nov 09, 2016 at 18:51 UTC

It's a question of whether overlapping matches are wanted or not. The code I posted in Re: FInding the longest match from an initial match between two files deliberately did not look for overlapping matches.

If overlapping matches are wanted, the regex could be changed to the following:

#!/usr/bin/perl -l

use strict;
use warnings;

my $k = 5;

my $file1contents = 'TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAAC
+ACCATCAT';
my $file2contents = 'ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTA
+CTATACTA';

$_ = "$file1contents\n$file2contents";

print "at position $-[0] is match $1" while /(?= (.{$k,}) .* \n .* \1 
+)/gx;
[download]

And the output from this change is:

at position 8 is match AAAAC
at position 27 is match ACTACTACT
at position 28 is match CTACTACT
at position 29 is match TACTACTACT
at position 30 is match ACTACTACT
at position 31 is match CTACTACT
at position 32 is match TACTACTACT
at position 33 is match ACTACTACT
at position 34 is match CTACTACT
at position 35 is match TACTACT
at position 36 is match ACTACT
at position 37 is match CTACT
at position 39 is match ACTTCAA
at position 40 is match CTTCAA
at position 41 is match TTCAA
at position 44 is match AAAAC
[download]

which shows the longer match you found (in fact, two of them, partially overlapping).

It all depends on what the output is going to be used for, I suppose. One of the reasons I posted the code was to prompt discussion about the problem.

[reply]
[d/l]
[select]


Do you know where your variables are?
	PerlMonks