Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: FInding the longest match from an initial match between two files

by Cristoforo (Curate)
on Nov 09, 2016 at 17:46 UTC ( [id://1175615]=note: print w/replies, xml ) Need Help??


in reply to Re: FInding the longest match from an initial match between two files
in thread FInding the longest match from an initial match between two files

Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT
Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA

Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT
Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA

The above is the line1 and line2 from your sequence sample. The first shows in red and blue 2 matches from the regex.

In the second identical set, you can see (in red), a match which is 1 character longer than the longest match (in red, above).

My question is why the regex made 2 captures here instead of the optimal match in the second (10 chars instead of 9).

The code which accidentally found this was:

my $xor = $file1contents ^ $file2contents; my $max = 0; my $max_str; my $pos; while ($xor =~ /(\0+)/g) { my $len = length $1; if ($len > $max) { $max = $len; $max_str = substr $file1contents, $-[0], $len; $pos = $-[0]; } #print "matched $-[0] ", substr $file1contents, $-[0], $+[0] - $-[ +0]; } print "at pos $pos max string is $max_str";
  • Comment on Re^2: FInding the longest match from an initial match between two files
  • Download Code

Replies are listed 'Best First'.
Re^3: FInding the longest match from an initial match between two files
by tybalt89 (Monsignor) on Nov 09, 2016 at 18:51 UTC

    It's a question of whether overlapping matches are wanted or not. The code I posted in Re: FInding the longest match from an initial match between two files deliberately did not look for overlapping matches.

    If overlapping matches are wanted, the regex could be changed to the following:

    #!/usr/bin/perl -l use strict; use warnings; my $k = 5; my $file1contents = 'TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAAC +ACCATCAT'; my $file2contents = 'ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTA +CTATACTA'; $_ = "$file1contents\n$file2contents"; print "at position $-[0] is match $1" while /(?= (.{$k,}) .* \n .* \1 +)/gx;

    And the output from this change is:

    at position 8 is match AAAAC at position 27 is match ACTACTACT at position 28 is match CTACTACT at position 29 is match TACTACTACT at position 30 is match ACTACTACT at position 31 is match CTACTACT at position 32 is match TACTACTACT at position 33 is match ACTACTACT at position 34 is match CTACTACT at position 35 is match TACTACT at position 36 is match ACTACT at position 37 is match CTACT at position 39 is match ACTTCAA at position 40 is match CTTCAA at position 41 is match TTCAA at position 44 is match AAAAC

    which shows the longer match you found (in fact, two of them, partially overlapping).

    It all depends on what the output is going to be used for, I suppose. One of the reasons I posted the code was to prompt discussion about the problem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1175615]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found