rogerd has asked for the wisdom of the Perl Monks concerning the following question:
Hi, I have a problem formatting some output. I have a reference DNA sequence, that is a string of letters, for example:
atgtagctagctagctaacgagcgctagctagctagtgatgactgat
Then, I have several substrings that match against that reference sequence, and what I want is print the aligned sequences, for example:
ref atgtagctagctagctaacgagcgctagctagctagtgatg
substr agctagctagctaac
I have already stored the position where the substring starts matching, and the length of the substring, so, if it where just one substring, I could do something like this:
$reference_string = 'gctagctgatgctagcagcagcatgtagctagctgacga'
$substring = 'aatgctagctagc'
$output_line = qw{ } x length($reference_string);
substr $output_line, $start_position, $length, $substring;
print $reference_string, "\n", $output_line;
The problem is that I have many substrings, that sometimes overlap between eachother. The resulting output should look something like this:
ref agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg
out1 atctagcat agctagcgatcga gactgacagc
out2 tagctagctgctagc
out3 agtcgatcgatgctagc
So I thougth the following rules (something like pseudocode):
create one blank line of output
foreach substring
take the first blank line
if there is no overlap
substr the blank line
if there is overlap
create a new blank line
substr the new blank line
But then, I can find an overlap in that second blank line too, so I would have to repeat the process, checking if there is an overlap in the second blank line, etc. I thought of writing a recursive subroutine, checking each time if there is an overlap in the first blank line, and continuing deeper until it finds there is no overlap, or it creates a new one.
Do you think is it a good strategy? Can I make it in a more clear way? I found this way somewhat cumbersome, and I couldn't manage to solve it already. Thank you very much in advance for your help
Roger
Re: formatting output question (use of recursive subroutine?)
by Limbic~Region (Chancellor) on Jul 29, 2008 at 14:36 UTC
|
rogerd,
If there is no need to put more than one match on a line, the solution would look more like
my $dna = 'gctagctgatgctagcagcagcatgtagctagctgacga';
for my $strand (@fragment) {
my $pos = index($dna, $strand);
next if $pos == -1;
my $pad = ' ' x $pos;
print $pad, $strand, "\n";
}
If you really need to put more than one match per line and don't care about efficiency (using the fewest number of total lines), you can use a bit mask to determine if "this" strand needs to go on to the next line. If you need help with that, speak up.
| [reply] [d/l] |
|
I need to to put more than one match per line, using the fewest number of total lines. The preference should be to put the match in the first line if there is no overlap, then the second, etc.
I will read something about bit mask, I don't know what is it.
| [reply] |
|
rogerd,
If you need to put more than one match on a line using the fewest number of total lines, you have a variation on Bin Packing Problem which is NP hard. I can write you a solution, but I don't have time at the moment (you can search this site and CPAN for examples that you could adapt).
Update: Following paragraph added
The bit mask isn't going to help by itself. It was just an idea for determining overlap. Because you require the matches on as fewest lines as possible, you have to try all possible combinations. This is why the problem is NP hard. I am assuming that for a given DNA sequence, the number of strands to place is quite large. If this assumption is correct, your perfect solution is going to be slow so you might want to look at a heuristic approach instead.
| [reply] |
|
|
Re: formatting output question (use of recursive subroutine?)
by JavaFan (Canon) on Jul 29, 2008 at 14:18 UTC
|
Do you think is it a good strategy?
Does my opinion matter? You are using a greedy approach for an NP complete problem (knapsack). So you don't get an optimal result (where optimal is, I presume, the use of a minimal number of lines), but you get it pretty quickly. Whether that's good enough, I don't know. Only you, and the enduser (or client), can decide that.
Now, if you have thousands of fragments, with many fragments on each line, you might want to use a segment tree for each of the lines you create, to be able to determine faster whether a new fragment fits a line.
Alternatively, you chould sort the fragments on starting position. Then, do something like:
0. Create a new empty line, and make it the current one.
1. Find the first fragment in the list that fits on the line.
2. If there is no such fragment, goto 0.
3. Add the fragment to the current line, and remove it from the list.
4. If the list is empty, you're done.
5. Goto 1.
This is also greedy; it won't find the optimal packing, but it's reasonbly fast.
| [reply] [d/l] |
|
%details_of_match = (
$gene_model =>( ($position1, $orig_pos1, $sequence1)
, ($position2, $orig_pos2, $sequence2)
, ($position3, $orig_pos3, $sequence3)
);
)
| [reply] [d/l] |
Re: formatting output question (use of recursive subroutine?)
by Pancho (Pilgrim) on Jul 29, 2008 at 14:07 UTC
|
Roger,
I think an iteration algorithm would be more straight forward (and this is subjective). I would break the task in two parts, first iterating through all of your strings to match and having an out1, out2, etc for each string to match. I would use a subroutine to perform this task of creating the out-n lines. Then I would code another paragraph to collapse all of the out lines together in the least number possible output lines, once using iteration and an array to keep track of the output line 'depth' (i.e. output_line_1, output_line_2, etc)... there are dozens of ways to approach this though... Good luck,
| [reply] |
|
my @output_lines;
my $model_number=0;
foreach my $gene_name (keys %details_of_match) {
$output_lines[$model_number]= qw{ } x length($original_seq_string)
+;
for (my $i=0; $i< scalar( @{$details_of_match{$gene_name}} ); $i++
+ ) {
substr $output_lines[$model_number] # the new li
+ne
, $details_of_match{$gene_name}->[$i][0] # the
+position
, 21 # the length (constant)
, $details_of_match{$gene_name}->[$i][2] #the patt
+ern
;
}
$model_number++;
}
At this point, I have all the lines, each one with just one substring in the right position, stored in the array @output_lines. But then, I get stuck with the second part, because I don't imagine how to collapse all the lines into the least number of possible lines. | [reply] [d/l] |
Re: formatting output question (use of recursive subroutine?)
by BrowserUk (Patriarch) on Jul 30, 2008 at 00:54 UTC
|
It won't produce 'optimal' results, but for display purposes it will probably be 'good enough'. It also has the virtues of being simple and determanistic (fast).
Basically, sort the segment/length pairs ascending by segment, descending by length. Scan the sorted array backward putting the next value into the current group (line) if it doesn't overlap the previous entry. When you can fit no more, start a new group (line). Repeat until no more bits:
#! perl -slw
use strict;
use Data::Dump qw[ pp ];
## Generate some data
my $string = '1234567890' x 7;
my @bits = map[ int rand length $string, 2 + int rand 10 ], 1 .. 20;
## Sort ascending offset/descending length
@bits = sort{
$a->[ 0 ] <=> $b->[ 0 ]
||
$b->[ 1 ] <=> $a->[ 1 ]
} @bits;
## build groups
my @groups;
## Till done
while( @bits ) {
## start a new group with the last bit
push @groups, [ pop @bits ];
## Scan the rest of the bits (backward so we can splice without ba
+d kharma)
for my $bit ( reverse 0 .. $#bits ) {
## if it'll fit in this line
if( $bits[ $bit ][ 0 ]+ $bits[ $bit ][ 1 ] < $groups[-1][-1][0
+] ) {
## add it
push @{ $groups[-1] }, splice @bits, $bit, 1;
}
}
}
## display the results
print $string;
for my $group ( @groups ){
my $line = ' ' x 70;
for my $bit ( @{ $group } ) {
substr $line, $bit->[ 0 ], $bit->[ 1 ],
substr $string, $bit->[ 0 ], $bit->[ 1 ];
}
print $line;
}
__END__
c:\test>700817.pl
1234567890123456789012345678901234567890123456789012345678901234567890
23 89 4567890 567890123 9012 567890123 012 0
234567890 56789012345 901234567 45678
123456789 45678 3456789012 890123
90123456 3456789
234
90123456
c:\test>700817.pl
1234567890123456789012345678901234567890123456789012345678901234567890
2345 890123 6789012 5678 234567 012345 2345 90
5678901234 789012345 12345 67890123 01234 90
45678901 6789 234567890 34567
0123456789 9012345
c:\test>700817.pl
1234567890123456789012345678901234567890123456789012345678901234567890
1234567 34567 234567890 9012 0
345678901 5678901 6789012 89012 567890
23456789012 567890 234567890
4567890123 789012345 0123456789
5678901234
45678901234
89012345
56789012345
c:\test>700817.pl
1234567890123456789012345678901234567890123456789012345678901234567890
123456 90 56789012 56789 567 234567 1234 234567890
67 123456789 23456 345678901 01234
678901234 1234 890
890123456 89012
678901234 45678901
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Thank you very much... I think that with some modifications to extract the offset and the length of the subsequences from my nested structure, and being able to sort them this will work very well. I am studying the code, it is not very easy for me yet to understand code written by others.
Thank you again, Roger
| [reply] |
|
it is not very easy for me yet to understand code written by others.
We all have that trouble :) Maybe this annotation will help. I'm always wary of doing this
because I tend to comment on the things I think are significant, rather than on those
that the reader will consider so. But Maybe it will help you.
Note: You could make @bits hold [ offset, $subsequence ]. You would then have to use
length( $bit[ n ][ 1 ] ) everywhere I've used $bit[ n ][ 1 ].
#! perl -slw
use strict;
## Generate some data
my $string = '1234567890' x 7;
## This just generates some random subsequences as offset/length pairs
## Array of Arrays (AoA): $bits[ n ] = [ offset, length]
my @bits = map [ int rand length $string, 2 + int rand 10 ], 1 .. 20;
## Sort ascending offset/descending length
@bits = sort{
$a->[ 0 ] <=> $b->[ 0 ] ## sort on the offsets of $a & $b
|| ## And if they are equal
$b->[ 1 ] <=> $a->[ 1 ] ## on the lengths (reversed}
} @bits;
## Result ex: @bits =( [2,5],[2,3],[3,1][4,4],[4,3] etc.
## @groups will become an AoAoA.
## Individual offset/length pairs are moved from @bits
## into @groups[n][m] Where
## n := line of output
## m := non-overlapping pair in line
## $groups[ n ] = [ [ offset1, length1], ...], [[,],[,]],...]
my @groups;
## As we're moving pairs from @bits to @groups[n]
## When @bits is empty, we know we're done.
while( @bits ) {
## start a new group with the last (rightmost shortest) item in @b
+its
## First time around the while loop, first line of output; 2nd tim
+e 2nd line
push @groups, [ pop @bits ];
## look at each of the remaining pairs in @bits
## scanning backwards because we are removing elements from @bits
## and if we went forward, removing element i would screw up the i
+ndexing
## for elements i+1 .. i+n.
for my $bit ( reverse 0 .. $#bits ) {
## if it'll fit in this line
## compare the last position (offset+length) for the current b
+it
## against the first position (offset) of the last element of
+the
## last group (line)
if( $bits[ $bit ][ 0 ]+ $bits[ $bit ][ 1 ] < $groups[-1][-1][0
+] ) {
## If it's is less (moving backward!), add it.
push @{ $groups[-1] }, splice @bits, $bit, 1;
}
}
## When the for loop ends, we've scanned all the bits and moved
## any that will fit in the current line without overlap.
## So now, if there are any left (while(@bits)) loop back to
## push another anon array onto @group add ing the now last elemen
+t
## of @bits as a starting point. Repeat the for loop.
}
## display the results
print $string;
for my $group ( @groups ){
my $line = ' ' x 70;
for my $bit ( @{ $group } ) {
substr $line, $bit->[ 0 ], $bit->[ 1 ],
substr $string, $bit->[ 0 ], $bit->[ 1 ];
}
print $line;
}
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Re: formatting output question (use of recursive subroutine?)
by Perlbotics (Archbishop) on Jul 29, 2008 at 16:38 UTC
|
Hi, this comes close to your suggestion, but it tries to find a match in the lines created before.
Don't know if it fits your requirements (see comments). And it can surely be optimised...
| [reply] [d/l] |
Re: formatting output question (use of recursive subroutine?)
by FunkyMonk (Chancellor) on Aug 02, 2008 at 20:27 UTC
|
my $ref = 'agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg';
my @subseq_to_find = qw{
gctagctag
tagcatgctagctag
gctagctgac
tag
};
my @output;
for my $subseq ( @subseq_to_find ) {
my $p = -1;
MATCH: while (1) {
$p = index $ref, $subseq, $p + 1;
last if $p < 0;
my $len = length $subseq;
for ( @output ) {
my $substr = \ substr $_, $p, $len;
$$substr = $subseq and next MATCH
unless $$substr =~ /\S/;
}
push @output, " " x length $ref;
substr $output[-1], $p, $len, $subseq;
}
}
print "$_\n" for $ref, @output;
Output:
agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg
gctagctag tag gctagctag gctagctag
gctagctag tag tag gctagctgac
tag tag tagcatgctagctag tag tag
Update: next SUBSEQ should be next MATCH. Added a smaller pattern to test the fix. I guess I'm still bored.
| [reply] [d/l] [select] |
|
Thank you FunkyMonk! Very helpful boreness! At this time I am working with the code that BrowserUK suggested me.
I had to change some algorithms of my previous code to use that algorithm, so I am working on that.
But I will take a look at yours also, it seems interesting. You use at least one function that I didn't know: index.
| [reply] |
Re: formatting output question (use of recursive subroutine?)
by FunkyMonk (Chancellor) on Jul 29, 2008 at 15:15 UTC
|
ref agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg
out1 atctagcat agctagcgatcga gactgacagc
out2 tagctagctgctagc
out3 agtcgatcgatgctagc
How do you define a match? I wouldn't say any of those strings matched anywhere.
| [reply] [d/l] |
|
I think in this context (DNA sequence matching) allowances are made for a small number of insertions, deletions and changes.
| [reply] |
|
Yes, that example doesn't match at all. I just write some atcg letters to put as an example. The matching was done previously, and now I am interested in formatting the results I had previusly.
The important thing here is that each substring should vertically match with the reference string, in the position that I have previously stored. Each of the substrings is one result: in this example, the line out1 have 3 different substrings, and out2 and out3 have one each (they are in different lines because they are overlapped).
| [reply] |
|
|