Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: string manipulation with Regex

by FluffyBunny (Acolyte)
on Sep 01, 2010 at 20:19 UTC ( [id://858408]=note: print w/replies, xml ) Need Help??


in reply to Re: string manipulation with Regex
in thread string manipulation with Regex

Dear CountZero and other senior Perlmonks who replied to my post: This is basically what I am trying to do:

you have a text file with string like this

CGAATTAATGGGAATTG

and you have your reference sequence saying

CGAATTAAGGAATTG

note your input has two letters inserted, which are TG

So you will get CIGAR ID from your alignment program saying

8M2I7M

To help you understand visually (so originally no blank), I manually aligned these two string variables.

CGAATTAATGGGAATTG

CGAATTAA GGAATTG

So using this I would like to keep the same original letter position for inputs... that is why I have to compare input string variables to reference string variables to make them have the same letter position (basically insertions are useless)

For deletions, I have the exact opposite situation.. so lets flip my first example's situation This case it will be 8M2D7M

CGAATTAA GGAATTG <--- my input for 2nd example (the gap is again intentionally made)

CGAATTAATGGGAATTG <--- my output for 2nd example

you see I am missing two letters and I need some letter holders to keep the input's letter positions the same compared to the reference.. so I want to fill two X's

CGAATTAAXXGGAATTG <--- same length, other letter positions will be the same

I also posted the link which leads to the original post that I had with a different problem (already fixed) there you can find my input files. Thank you for your help

Replies are listed 'Best First'.
Re^3: string manipulation with Regex
by Marshall (Canon) on Sep 01, 2010 at 22:48 UTC
    Update: missed the part about multiple I,D sections....so adjusted loop to do that. And now I see that there was some Count Zero code prior to the thread level I've replied to. His code looks fine to me. What I did is very similar except that I used substr() instead of print.

    I think this does what you want. Basically in the CIGAR, an insertion becomes a deletion and vice-versa. So I use the edit instructions in the CIGAR in an inverse sense.

    The total field lengths in the CIGAR (viewed in inverted sense) may be less than the number of characters in the input, so I think this means truncate the output to whatever that total is.

    whether or not some final adjustment to either truncate or perhaps add more "X"'s after inverse of all editing commands is unclear to me - just a matter of knowing what is required - that's why I kept a running tally of the total length.

    #!/usr/bin/perl -w use strict; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($input, $CIGAR) = split; my $ref = $input; #working copy of $input my (@edit_cmd) = $CIGAR =~ m/\d+\w/g; my $curr_pos = 0; my $total_len =0; foreach my $cmd (@edit_cmd) { if (my ($M) = $cmd =~ m/(\d+)M/) { $curr_pos += $M; $total_len+= $M; } elsif (my ($I) = $CIGAR =~ m/(\d+)I/) { substr($ref,$curr_pos,$I,''); #delete $I characters $total_len -= $I } elsif (my ($D) = $CIGAR =~ m/(\d+)D/) { substr($ref,$curr_pos,0,"X" x $D); #insert $D X's $total_len += $D; $curr_pos += $D; } } $ref = substr($ref,0,$total_len); #truncate ????? print "INPUT = $input CIGAR = $CIGAR\n"; print "REF = $ref\n\n"; } =prints INPUT = CGAATTAATGGGAATTG CIGAR = 8M2I7M REF = CGAATTAAGGAAT INPUT = CGAATTAATGGGAATTG CIGAR = 2M2I2M3D10M REF = CGTTTGGGAA INPUT = CGAATTAATGGGA CIGAR = 8M2D7M REF = CGAATTAAXXTGGGA =cut __DATA__ CGAATTAATGGGAATTG 8M2I7M CGAATTAATGGGAATTG 2M2I2M3D10M CGAATTAATGGGA 8M2D7M
      Thank you very much for your assistance! It makes sense to me. Now I have to work on multiple D and I combo. You are all awesome Perlmonks! =D
        take a look multiple D and multiple I are already implemented.

      That almost worked for my own tests. I switched $CIGAR to $cmd in the elsif statements and it started doing it no matter what combination of I's, D's, or M's.

      Thanks for your help! =)

Re^3: string manipulation with Regex
by CountZero (Bishop) on Sep 01, 2010 at 21:45 UTC
    Did you actually try my program with the sample inputs you mention above?

    If so, you will have seen that the output is exactly as you expect! So what is still your problem?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I just wanted to explain more clearly. The code itself is working. Thank you for your help =)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://858408]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-03-29 06:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found