Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

formatting output question (use of recursive subroutine?)

by rogerd (Sexton)
on Jul 29, 2008 at 13:47 UTC ( [id://700817]=perlquestion: print w/replies, xml ) Need Help??

rogerd has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a problem formatting some output. I have a reference DNA sequence, that is a string of letters, for example:

atgtagctagctagctaacgagcgctagctagctagtgatgactgat

Then, I have several substrings that match against that reference sequence, and what I want is print the aligned sequences, for example:

ref atgtagctagctagctaacgagcgctagctagctagtgatg substr agctagctagctaac

I have already stored the position where the substring starts matching, and the length of the substring, so, if it where just one substring, I could do something like this:

$reference_string = 'gctagctgatgctagcagcagcatgtagctagctgacga' $substring = 'aatgctagctagc' $output_line = qw{ } x length($reference_string); substr $output_line, $start_position, $length, $substring; print $reference_string, "\n", $output_line;

The problem is that I have many substrings, that sometimes overlap between eachother. The resulting output should look something like this:

ref agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg out1 atctagcat agctagcgatcga gactgacagc out2 tagctagctgctagc out3 agtcgatcgatgctagc

So I thougth the following rules (something like pseudocode):

create one blank line of output foreach substring take the first blank line if there is no overlap substr the blank line if there is overlap create a new blank line substr the new blank line

But then, I can find an overlap in that second blank line too, so I would have to repeat the process, checking if there is an overlap in the second blank line, etc. I thought of writing a recursive subroutine, checking each time if there is an overlap in the first blank line, and continuing deeper until it finds there is no overlap, or it creates a new one.

Do you think is it a good strategy? Can I make it in a more clear way? I found this way somewhat cumbersome, and I couldn't manage to solve it already. Thank you very much in advance for your help

Roger

Replies are listed 'Best First'.
Re: formatting output question (use of recursive subroutine?)
by Limbic~Region (Chancellor) on Jul 29, 2008 at 14:36 UTC
    rogerd,
    If there is no need to put more than one match on a line, the solution would look more like
    my $dna = 'gctagctgatgctagcagcagcatgtagctagctgacga'; for my $strand (@fragment) { my $pos = index($dna, $strand); next if $pos == -1; my $pad = ' ' x $pos; print $pad, $strand, "\n"; }
    If you really need to put more than one match per line and don't care about efficiency (using the fewest number of total lines), you can use a bit mask to determine if "this" strand needs to go on to the next line. If you need help with that, speak up.

    Cheers - L~R

      I need to to put more than one match per line, using the fewest number of total lines. The preference should be to put the match in the first line if there is no overlap, then the second, etc.

      I will read something about bit mask, I don't know what is it.

        rogerd,
        If you need to put more than one match on a line using the fewest number of total lines, you have a variation on Bin Packing Problem which is NP hard. I can write you a solution, but I don't have time at the moment (you can search this site and CPAN for examples that you could adapt).

        Update: Following paragraph added
        The bit mask isn't going to help by itself. It was just an idea for determining overlap. Because you require the matches on as fewest lines as possible, you have to try all possible combinations. This is why the problem is NP hard. I am assuming that for a given DNA sequence, the number of strands to place is quite large. If this assumption is correct, your perfect solution is going to be slow so you might want to look at a heuristic approach instead.

        Cheers - L~R

Re: formatting output question (use of recursive subroutine?)
by JavaFan (Canon) on Jul 29, 2008 at 14:18 UTC
    Do you think is it a good strategy?

    Does my opinion matter? You are using a greedy approach for an NP complete problem (knapsack). So you don't get an optimal result (where optimal is, I presume, the use of a minimal number of lines), but you get it pretty quickly. Whether that's good enough, I don't know. Only you, and the enduser (or client), can decide that.

    Now, if you have thousands of fragments, with many fragments on each line, you might want to use a segment tree for each of the lines you create, to be able to determine faster whether a new fragment fits a line.

    Alternatively, you chould sort the fragments on starting position. Then, do something like:

    0. Create a new empty line, and make it the current one. 1. Find the first fragment in the list that fits on the line. 2. If there is no such fragment, goto 0. 3. Add the fragment to the current line, and remove it from the list. 4. If the list is empty, you're done. 5. Goto 1.
    This is also greedy; it won't find the optimal packing, but it's reasonbly fast.

      I liked this algorithm. My problem now is sorting the fragments based on the start position because that info is stored somewhat deep in a hash;

      %details_of_match = ( $gene_model =>( ($position1, $orig_pos1, $sequence1) , ($position2, $orig_pos2, $sequence2) , ($position3, $orig_pos3, $sequence3) ); )
Re: formatting output question (use of recursive subroutine?)
by Pancho (Pilgrim) on Jul 29, 2008 at 14:07 UTC

    Roger,

    I think an iteration algorithm would be more straight forward (and this is subjective). I would break the task in two parts, first iterating through all of your strings to match and having an out1, out2, etc for each string to match. I would use a subroutine to perform this task of creating the out-n lines.

    Then I would code another paragraph to collapse all of the out lines together in the least number possible output lines, once using iteration and an array to keep track of the output line 'depth' (i.e. output_line_1, output_line_2, etc)... there are dozens of ways to approach this though...

    Good luck,

    Pancho

      I have thought about this approach. I have already written the first part (the data is stored in the hash %details of match):

      my @output_lines; my $model_number=0; foreach my $gene_name (keys %details_of_match) { $output_lines[$model_number]= qw{ } x length($original_seq_string) +; for (my $i=0; $i< scalar( @{$details_of_match{$gene_name}} ); $i++ + ) { substr $output_lines[$model_number] # the new li +ne , $details_of_match{$gene_name}->[$i][0] # the +position , 21 # the length (constant) , $details_of_match{$gene_name}->[$i][2] #the patt +ern ; } $model_number++; }

      At this point, I have all the lines, each one with just one substring in the right position, stored in the array @output_lines. But then, I get stuck with the second part, because I don't imagine how to collapse all the lines into the least number of possible lines.

Re: formatting output question (use of recursive subroutine?)
by BrowserUk (Patriarch) on Jul 30, 2008 at 00:54 UTC

    It won't produce 'optimal' results, but for display purposes it will probably be 'good enough'. It also has the virtues of being simple and determanistic (fast).

    Basically, sort the segment/length pairs ascending by segment, descending by length. Scan the sorted array backward putting the next value into the current group (line) if it doesn't overlap the previous entry. When you can fit no more, start a new group (line). Repeat until no more bits:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; ## Generate some data my $string = '1234567890' x 7; my @bits = map[ int rand length $string, 2 + int rand 10 ], 1 .. 20; ## Sort ascending offset/descending length @bits = sort{ $a->[ 0 ] <=> $b->[ 0 ] || $b->[ 1 ] <=> $a->[ 1 ] } @bits; ## build groups my @groups; ## Till done while( @bits ) { ## start a new group with the last bit push @groups, [ pop @bits ]; ## Scan the rest of the bits (backward so we can splice without ba +d kharma) for my $bit ( reverse 0 .. $#bits ) { ## if it'll fit in this line if( $bits[ $bit ][ 0 ]+ $bits[ $bit ][ 1 ] < $groups[-1][-1][0 +] ) { ## add it push @{ $groups[-1] }, splice @bits, $bit, 1; } } } ## display the results print $string; for my $group ( @groups ){ my $line = ' ' x 70; for my $bit ( @{ $group } ) { substr $line, $bit->[ 0 ], $bit->[ 1 ], substr $string, $bit->[ 0 ], $bit->[ 1 ]; } print $line; } __END__ c:\test>700817.pl 1234567890123456789012345678901234567890123456789012345678901234567890 23 89 4567890 567890123 9012 567890123 012 0 234567890 56789012345 901234567 45678 123456789 45678 3456789012 890123 90123456 3456789 234 90123456 c:\test>700817.pl 1234567890123456789012345678901234567890123456789012345678901234567890 2345 890123 6789012 5678 234567 012345 2345 90 5678901234 789012345 12345 67890123 01234 90 45678901 6789 234567890 34567 0123456789 9012345 c:\test>700817.pl 1234567890123456789012345678901234567890123456789012345678901234567890 1234567 34567 234567890 9012 0 345678901 5678901 6789012 89012 567890 23456789012 567890 234567890 4567890123 789012345 0123456789 5678901234 45678901234 89012345 56789012345 c:\test>700817.pl 1234567890123456789012345678901234567890123456789012345678901234567890 123456 90 56789012 56789 567 234567 1234 234567890 67 123456789 23456 345678901 01234 678901234 1234 890 890123456 89012 678901234 45678901

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thank you very much... I think that with some modifications to extract the offset and the length of the subsequences from my nested structure, and being able to sort them this will work very well. I am studying the code, it is not very easy for me yet to understand code written by others.

      Thank you again, Roger

        it is not very easy for me yet to understand code written by others.

        We all have that trouble :) Maybe this annotation will help. I'm always wary of doing this because I tend to comment on the things I think are significant, rather than on those that the reader will consider so. But Maybe it will help you.

        Note: You could make @bits hold [ offset, $subsequence ]. You would then have to use length( $bit[ n ][ 1 ] ) everywhere I've used $bit[ n ][ 1 ].

        #! perl -slw use strict; ## Generate some data my $string = '1234567890' x 7; ## This just generates some random subsequences as offset/length pairs ## Array of Arrays (AoA): $bits[ n ] = [ offset, length] my @bits = map [ int rand length $string, 2 + int rand 10 ], 1 .. 20; ## Sort ascending offset/descending length @bits = sort{ $a->[ 0 ] <=> $b->[ 0 ] ## sort on the offsets of $a & $b || ## And if they are equal $b->[ 1 ] <=> $a->[ 1 ] ## on the lengths (reversed} } @bits; ## Result ex: @bits =( [2,5],[2,3],[3,1][4,4],[4,3] etc. ## @groups will become an AoAoA. ## Individual offset/length pairs are moved from @bits ## into @groups[n][m] Where ## n := line of output ## m := non-overlapping pair in line ## $groups[ n ] = [ [ offset1, length1], ...], [[,],[,]],...] my @groups; ## As we're moving pairs from @bits to @groups[n] ## When @bits is empty, we know we're done. while( @bits ) { ## start a new group with the last (rightmost shortest) item in @b +its ## First time around the while loop, first line of output; 2nd tim +e 2nd line push @groups, [ pop @bits ]; ## look at each of the remaining pairs in @bits ## scanning backwards because we are removing elements from @bits ## and if we went forward, removing element i would screw up the i +ndexing ## for elements i+1 .. i+n. for my $bit ( reverse 0 .. $#bits ) { ## if it'll fit in this line ## compare the last position (offset+length) for the current b +it ## against the first position (offset) of the last element of +the ## last group (line) if( $bits[ $bit ][ 0 ]+ $bits[ $bit ][ 1 ] < $groups[-1][-1][0 +] ) { ## If it's is less (moving backward!), add it. push @{ $groups[-1] }, splice @bits, $bit, 1; } } ## When the for loop ends, we've scanned all the bits and moved ## any that will fit in the current line without overlap. ## So now, if there are any left (while(@bits)) loop back to ## push another anon array onto @group add ing the now last elemen +t ## of @bits as a starting point. Repeat the for loop. } ## display the results print $string; for my $group ( @groups ){ my $line = ' ' x 70; for my $bit ( @{ $group } ) { substr $line, $bit->[ 0 ], $bit->[ 1 ], substr $string, $bit->[ 0 ], $bit->[ 1 ]; } print $line; }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: formatting output question (use of recursive subroutine?)
by Perlbotics (Archbishop) on Jul 29, 2008 at 16:38 UTC
    Hi, this comes close to your suggestion, but it tries to find a match in the lines created before. Don't know if it fits your requirements (see comments). And it can surely be optimised...
Re: formatting output question (use of recursive subroutine?)
by FunkyMonk (Chancellor) on Aug 02, 2008 at 20:27 UTC
    I was bored :)
    my $ref = 'agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg'; my @subseq_to_find = qw{ gctagctag tagcatgctagctag gctagctgac tag }; my @output; for my $subseq ( @subseq_to_find ) { my $p = -1; MATCH: while (1) { $p = index $ref, $subseq, $p + 1; last if $p < 0; my $len = length $subseq; for ( @output ) { my $substr = \ substr $_, $p, $len; $$substr = $subseq and next MATCH unless $$substr =~ /\S/; } push @output, " " x length $ref; substr $output[-1], $p, $len, $subseq; } } print "$_\n" for $ref, @output;

    Output:

    agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg gctagctag tag gctagctag gctagctag gctagctag tag tag gctagctgac tag tag tagcatgctagctag tag tag

    Update: next SUBSEQ should be next MATCH. Added a smaller pattern to test the fix. I guess I'm still bored.


    Unless I state otherwise, all my code runs with strict and warnings
      Thank you FunkyMonk! Very helpful boreness! At this time I am working with the code that BrowserUK suggested me. I had to change some algorithms of my previous code to use that algorithm, so I am working on that. But I will take a look at yours also, it seems interesting. You use at least one function that I didn't know: index.
Re: formatting output question (use of recursive subroutine?)
by FunkyMonk (Chancellor) on Jul 29, 2008 at 15:15 UTC
    ref agctagctagctagcatgctagctagctgatcgatgctagctagctgactgacgacg out1 atctagcat agctagcgatcga gactgacagc out2 tagctagctgctagc out3 agtcgatcgatgctagc
    How do you define a match? I wouldn't say any of those strings matched anywhere.

    Unless I state otherwise, all my code runs with strict and warnings
      I think in this context (DNA sequence matching) allowances are made for a small number of insertions, deletions and changes.

      Yes, that example doesn't match at all. I just write some atcg letters to put as an example. The matching was done previously, and now I am interested in formatting the results I had previusly.

      The important thing here is that each substring should vertically match with the reference string, in the position that I have previously stored. Each of the substrings is one result: in this example, the line out1 have 3 different substrings, and out2 and out3 have one each (they are in different lines because they are overlapped).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://700817]
Approved by varian
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-19 04:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found