Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

how to process very large tab limited pileup file

by gudluck (Novice)
on Oct 01, 2010 at 19:34 UTC ( [id://863010]=perlquestion: print w/replies, xml ) Need Help??

gudluck has asked for the wisdom of the Perl Monks concerning the following question:

I need to process very large (>10GB) tab delimited Pileup files.

I am reading the file and if it satisfies some criteria iam concatinating colum4 letters to a string. I have few issues to address. 1. Some times a line before indel (* in 3rd(t2) column) have multiple extra strings representing an insert (+1A or +4caaa) or deletion (-4gtct or -4tctt or -13GGCGCGCGTGCGC ) strings in the read colum (in column 9; t8) how to get rid of them. I tried awk to preprocess the file to remove them by brute force  awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' but suppose if the nucleotide immediately after the above string (-4gtct) is not a (.,) and if it is (atgcATGC) then it will also be removed. I need to specifically look for +/- pattern followed by a number(digits) and the deleting that many number of letters following it (look for -4 then remove -4 and gtct)

2. when ever there is an indel line (* in 3rd(t2) column) I need to check colum 4 (t3) and see if it is (*/* or */+ or */- or +/+ or +/- or -/- or -/+ ) insertion (+) or deletion (-) on the numerator side (ie +/- means i need to treat it as insert(+) only and if it is */- or */+ then i need to check if denominator (- or +) is represented by more than 5% of reads (i.e column12/column8) then for deletion (-) convert that many nucleotides line following it to dash - in column 4 (t3). and if it is an insertion (+) then add those addition nucleotides to the string.

Please help with suggetions

representative sample Pileup file (it is continous from 1 to n(column2) (here lines are discontious as I pasted few representative lines here and there)

#t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 2L 1 G A 48 48 26 7 AAaaAAA ACCCD?@ 2L 2 A A 48 0 26 7 ..,,... AACC@BC 2L 3 T K 18 18 26 7 .Ggg... ACCCCCC 2L 4 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBCCBBC@3CCBBD=? 2L 5 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CDCCDACB9CCA@B?@D 2L 6 c C 78 0 37 17 .,..,,.,,.,,,,,,, CBCCBAC>9CBB@BA?A 2L 7 c C 45 0 37 19 .,..,,.,,.+1A,,,,t,,^F,^Ft CDCCDAC>;C +BCBD*?DAA 2L 7 * */* 55 0 37 19 * A 18 1 0 0 0 2L 8 a A 162 0 21 45 ,-4gtct,-4gtct,,....,,,,,,.,.,,,,.,,,, +,,,,...,.,.,,,,.,,, CCCCA<;=/CCACDBCBCCCCDCCCCBBCCBDCCCCCDCDDCCAB 2L 8 * */* 173 0 21 45 * -gtct 43 2 0 0 0 2L 9 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, CDCCCAC?BCBBCD +?<DBBB 2L 10 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, CBCDABC>BCACBB +:AC?4C 2L 11 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. CDCCCAC@@CDDCC +<:DBABC 2L 12 c C 117 0 14 31 ...,,*.,,-2gt.,,.,,.,..........,,.. ;B +6CCBBCBCACDCCCB@DCDCC8CCCABCC 2L 12 * */-gt 21 21 14 31 * -gt 30 1 0 0 0 2L 13 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,. CDCCCAC?:CDB5? +<>BB>BC 2L 14 t T 45 0 37 6 ..,,,.-1T CCCCCC 2L 14 * -t/-t 56 59 37 6 -t * 1 5 0 0 0 2L 15 t T 93 0 37 22 .,..,,.,,.,,,G,,,,,,.^F. CDCDDDC<@C +BB5B@ABC<ACC 2L 16 G G 178 0 36 50 .$,$,$,$,,,,,.,,,,..,.,,...,,,,.,,..., +....,...,,,$,,,., BCCCCCCCCCCCCBDBCBCCB>8CCCCBCCBDBC8>ACCBB6CC?CCC6C 2L 17 A A 59 0 36 45 ,$,$,,,C,,,c..,.,,C..,,,,.,,...,....,. +.C,,,,,$., CCCCCBCCCBB8CACC>@;CCCCDCCCDAC-5ABCAA2CCCC?1A 2L 18 w T 54 54 37 9 tTTTTttTT CCDCCDCCC 2L 18 * A/+A 65 279 37 9 A * 8 1 0 0 0 2L 19 g G 54 0 37 9 ....,,... >>@@CC>BB 2L 20 c C 57 0 37 10 ....,,...^F, CACCCC?CBB 2L 21 t T 57 0 37 10 ....,,..., BCDCCCCCB? 2L 650 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?CCA=? 2L 650 * */* 9 0 34 7 * +CAAA 5 2 0 0 0 2L 654 A A 48 0 34 7 .$.$,.,+1g,, DBC?CCA 2L 654 * */* 19 0 34 7 * +G 6 1 0 0 0 2L 2332 g G 60 0 14 33 .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A +,..........aa.. DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC 2L 2332 * */* 61 0 14 33 * -ggcgcgcgtgcgc 32 1 0 0 + 0 2L 3334 a A 163 0 15 49 ..$,..,,,t,.T,,,..,,,,T,-7attattt, +,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCC +CCDCCCCCCCCBCBCDCC 2L 3334 * */-attattt 27 27 15 49 * -attattt 47 2 0 + 0 0 2L 3928 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!, + CCC8CCCCBCA 2L 3928 * */-tctt 157 157 0 11 * -tctt 8 3 0 0 0

Replies are listed 'Best First'.
Re: how to process very large tab limited pileup file
by lima1 (Curate) on Oct 01, 2010 at 21:13 UTC
    What have you tried so far, apart from this awk command? It's quite difficult to understand what you are trying to do...

    Show some code and explain where it fails and you will get help. If you don't know that already, you can and should use Text::CSV to parse the file.

      My failed attempt:
      my $countref = ($t[8] =~ tr/.,//); my $ratioo = $countref/$t[7] ; if ($countref/$t[7] >0.95) { # print "$t[1]\t$t[2]\t$t[3]: $ratioo \n"; $seq .=$t[2]; } elsif ($countref/$t[7] <=0.95){ $t[8] =~ s/[.,]/$t[2]/g; my $counta = ($t[8] =~ tr/aA//); my $countt = ($t[8] =~ tr/tT//); my $countg = ($t[8] =~ tr/gG//); my $countc = ($t[8] =~ tr/cC//); my ($ratioa, $ratiot, $ratiog, $ratioc, $rat); my @rat = (); $ratioa = $counta/$t[7]; $ratiot = $countt/$t[7]; $ratiog = $countg/$t[7]; $ratioc = $countc/$t[7]; print " $t[1] : A $ratioa T $ratiot G $ratiog C $ratioc "; my $rrat = \@rat; push (@rat, "$ratioa", "$ratiot", "$ratiog", "$ratioc"); my $countnut = 0; my $i; my @ncode; my $ccode; for $i(0..$#rat){ if ( $rrat->[$i] >= 0.05){ $i++; push (@ncode, $i); $countnut ++; $i--; } $ccode = join ("", @ncode); if ($countnut >= 3) { # print "N\n"; $seq .= "N"; } #print " code:$ccode count $countnut\n"; else { my %hash; %hash = {1, 'A', 2, 'T', 3, 'G', 4, 'C', 12, 'W', 21, 'W', 13, 'R', 31, 'R', 14, 'M', 41, 'M', 23, 'K', 32, 'K', 24, 'Y', 42, 'Y', }; print "cc", $hash{$ccode},"\n" ; } } }
        Like the other monks, I don't have a clue what your goals really are. (Try using shorter sentences when explaining things.) I took your code and data as posted, and put them together into a single file like this:
        #!/usr/bin/perl while (<DATA>) { next if (/^#/ or /^\s*$/); chomp; @t = split /\t/; # your code goes here } __DATA__ # your data goes here
        (update: I initially forgot to mention "split")

        I had to edit the data to make it tab-delimited -- maybe the tabs were converted to spaces because of the way you posted it (or the way I downloaded it), but you may need to check and make sure whether your file really has the right number of tabs per line.

        Anyway, your code ran and produced output with no errors. So what's the problem? If you were expecting some different sort of output, you'll need to explain how the actual output differs from the desired output.

        I took the liberty of fixing the indentation in the code -- this really helps for making the code readable, and a good editor should make it easy to use proper indentation (e.g. emacs, vi, ...)

        I also did some refactoring, to remove unnecessary variables, make the output more informative, and use loops where possible.

        I couldn't tell what you were trying to do with your "$seq" variable... that (and the output format) is the only place where my version does something different from yours -- but I'm just guessing about what you really want.

Re: how to process very large tab limited pileup file
by johngg (Canon) on Oct 02, 2010 at 12:32 UTC

    I would like to be able to help you but, as lima1 says, your description of the problem is difficult to understand. I'm afraid the code you posted in response to lima1's reply hasn't shed much light on the problem because you do not show how you are reading your file and we can only make assumptions regarding how the @t array is intialised. Also, what is the purpose of the %hash associative array (nice meaningful name, BTW :-)?

    It would help us to help you if you could provide the output you would expect to get from the data you show and if you could break down your description in paras. 1 & 2 into smaller steps it might help us understand better. I find the +/- etc. malarkey particularly unclear.

    Cheers,

    JohnGG

Re: how to process very large tab limited pileup file
by umasuresh (Hermit) on Oct 02, 2010 at 13:59 UTC
    Can you post the desired output format? This might help in better understanding the problem.
      Thanks a lot "graff" your inputs were quite useful. I could be able to write the code. (suggestions for the improvements are welcome) This code take the inputs: 1. raw pileup file and the 2. consensus fasta sequence(here it is named as "input.fasta"). and modifies the consensus fasta sequence based on the nucleotide frequences (if above 5%) at each position (from the read column (column 9). It also includes deletions if they above 5% in indel line (i.e. lines with * in the column 3)
      >input.fasta GATCCccagagcgttGAwgcttAgacTCCgagc
      Find the complete code below: usage: perl modify.pileup.pl pileup2markedfasta input.pileup
      #!/usr/bin/perl use strict; use warnings; &usage if (@ARGV < 1); my $command = shift(@ARGV); my %func = (pileup2markedfasta=>\&pileup2fq); die("Unknown command \"$command\".\n") if (!defined($func{$command})); &{$func{$command}}; exit(0); # pileup2markedfasta sub pileup2fq { die(qq/ Usage: modify.pileup.pl pileup2marked.fasta [options] <in.cns-pileup +> /) if (@ARGV == 0 && -t STDIN); my %hash = ( 1 => 'A', 2 => 'T', 3 => 'G', 4 => 'C', 12 => 'W', 21 => 'W', 13 => 'R', 31 => 'R', 14 => 'M', 41 => 'M', 23 => 'K', 32 => 'K', 24 => 'Y', 42 => 'Y', 34 => 'S', 43 => 'S' ); my $seq; my $ConsFileName = "input.fasta" ; # Input single chromosome consensus + fasta file open(INFILE, "$ConsFileName"); my $header ; while(my $line = <INFILE>) { if ($line =~ m/^\>/){ $header = $line; # Fasta header is removed; otherwise counted } else{ chomp $line; # Remove the "\n" $seq .= $line; } } close(INFILE); print "$header"; print "$seq\n"; my @gps; my $dsize ; my $rseq = ''; my $report = ''; my @t; while (<>) { @t = split; if ($t[7] < 7){substr($seq, ($t[1]-1), 1, "N");} # minimum number ( +7) of reads elsif ($t[2] eq '*') { # checking if it is indel in the raw pile +up if (($t[3] =~ /\*\/\+/)||($t[3] =~ /^\+/)){ #print "next $t[3] \n"; next; } elsif ( ($t[3] =~ /\*\/\*/) && ($t[9] =~ /^\-/) && ($t[11]/$t[7] > +=0.05) ){ my $lenn =(length($t[9])-1); for (1..$lenn){ push (@gps, '-') } my $ndash .= '-' x ($lenn); # print "$t[1]: $lenn, @gps, dash** $ndash\n"; substr($seq, $t[1], $lenn, "$ndash"); } elsif ( ($t[3] =~ /\*\/\-/) && ($t[11]/$t[7] >=0.05) ){ my $lenn =(length($t[9])-1); for (1..$lenn){ push (@gps, '-') } my $ndash .= '-' x ($lenn); # print "$t[1] dash*- $ndash\n"; substr($seq, $t[1], $lenn, "$ndash"); } elsif (($t[3] =~ /^\-/) && ($t[10]/$t[7] >=0.05)){ my $lenn =(length($t[8])-1); for (1..$lenn){ push (@gps, '-') } my $ndash .= '-' x ($lenn); # print "$t[1] 'dash-' $ndash\n"; substr($seq, $t[1], $lenn, "$ndash"); } } else { $dsize = scalar(@gps); if ($dsize >=1){shift @gps; #print "$dsize: $t[1]skippped\n"; next; } else { my $countref = ($t[8] =~ tr/.,//); if ( $countref / $t[7] > 0.95 ){next;} elsif ($countref/$t[7] <=0.95){ $t[8] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+//g; + # removes unwanted indels from read base lines $t[8] =~ s/[\@\'\&\!\^\$\*\)]+//g; # removes garbage symbo +ls $t[8] =~ s/[.,]/$t[2]/g; my %ratio; my @letters = qw/a t g c/; $report = " $t[1] : "; for my $p ( @letters ) { #`echo "$t[1]"`; my $count = eval "lc( '$t[8]' ) =~ tr/$p//"; $ratio{$p} = $count / $t[7]; # print "$p: $ratio{$p}\n"; $report .= sprintf( "%s %5.3f ", uc( $p ), $ratio{$p} ); } my $countnut = 0; my @ncode =(); for my $i (0 .. $#letters) { if ( $ratio{ $letters[$i] } >= 0.05 ) { push @ncode, $i + 1; #print "$t[1], @ncode, $letters[$i]\n"; $countnut++; } } my $ccode = join ("", @ncode); chomp $ccode; if ( ! exists( $hash{$ccode} )) { # print "not $t[1], $ccode\n"; substr($seq, ($t[1]-1), 1, "N"); } else { # print "yes $t[1], $ccode\n"; substr($seq, ($t[1]-1), 1, "$hash{$ccode}"); } } } # printf( "Data_line %5d : t1= %5s ; report= %s \n", $., $t[1], $r +eport ); } } # end of while print "$header"; # printing back the fasta header $seq= uc($seq) ; # print "$seq\n"; ; &p2q_post_process(\$seq) ; #print "$header"; # printing back the fasta header } # end of pileup ################## sub p2q_post_process { my ($seq) = @_; print &p2q_print_str($seq); } # sub p2q_print_str { my ($s) = @_; my $l = length($$s); for (my $i = 0; $i < $l; $i += 60) { print substr($$s, $i, 60), "\n"; } } # sub usage { die(qq/ Program: modify.pileup.pl Usage: modify.pileup.pl <command> [<arguments>]\n Command: pileup2markedfasta generate maked.fasta from `pileup -c r +aw pileup' \n/); }
      input.pileup
      2L 1 G A 48 48 26 7 AAaaAAA ACCCD?@ 2L 2 A A 48 0 26 7 ..,,... AACC@BC 2L 3 T K 18 18 26 7 .Ggg... ACCCCCC 2L 4 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBCC +BBC@3CCBBD=? 2L 5 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CDC +CDACB9CCA@B?@D 2L 6 c C 78 0 37 17 .,..,,.,,.,,,,,,, CBCCB +AC>9CBB@BA?A 2L 7 c C 45 0 37 19 .,..,,.,,.+1A,,,,t,,^F,^Ft + CDCCDAC>;CBCBD*?DAA 2L 7 * */* 55 0 37 19 * A 18 1 0 + 0 0 2L 8 a A 162 0 21 45 ,-4gtct,-4gtct,,....,,,,, +,.,.,,,,.,,,,,,,,...,.,.,,,,.,,, CCCCA<;=/CCACDBCBCCCCDCCCCBBCCBDC +CCCCDCDDCCAB 2L 8 * */* 173 0 21 45 * -gtct 43 3 + 0 0 0 2L 9 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, +CDCCCAC?BCBBCD?<DBBB 2L 10 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, C +BCDABC>BCACBB:AC?4C 2L 11 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. + CDCCCAC@@CDDCC<:DBABC 2L 12 c C 117 0 14 31 ...,,*.,,-2gt.,,.,,.,... +.......,,.. ;B6CCBBCBCACDCCCB@DCDCC8CCCABCC 2L 12 * */-gt 21 21 14 31 * -gt 30 1 + 0 0 0 2L 13 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,. +CDCCCAC?:CDB5?<>BB>BC 2L 14 t T 45 0 37 6 ..,,,.-1T CCCCCC 2L 14 * -t/-t 56 59 37 6 -t * 1 5 +0 0 0 2L 15 t T 93 0 37 22 .,..,,.,,.,,,G,,,,,,.^F. + CDCDDDC<@CBB5B@ABC<ACC 2L 16 G G 178 0 36 50 .$,$,$,$,,,,,.,,,,..,.,, +...,,,,.,,...,....,...,,,$,,,., BCCCCCCCCCCCCBDBCBCCB>8CCCCBCCBDBC +8>ACCBB6CC?CCC6C 2L 17 A A 59 0 36 45 ,$,$,,,C,,,c..,.,,C..,,,, +.,,...,....,..C,,,,,$., CCCCCBCCCBB8CACC>@;CCCCDCCCDAC-5ABCAA2CCCC +?1A 2L 18 w T 54 54 37 9 tTTTTttTT CCDCCDCCC 2L 18 * A/+A 65 279 37 9 A * 8 1 0 + 0 0 2L 19 g G 54 0 37 9 ....,,... >>@@CC>BB 2L 20 c C 57 0 37 10 ....,,...^F, CACCCC?CB +B 2L 21 t T 57 0 37 10 ....,,..., BCDCCCCCB? 2L 22 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?C +CA=? 2L 22 * */* 9 0 34 7 * +CAAA 5 2 0 + 0 0 2L 23 A A 48 0 34 7 .$.$,.,+1g,, DBC?CCA 2L 23 * */* 19 0 34 7 * +G 6 1 0 + 0 0 2L 24 g G 60 0 14 33 .,...,,.-13GGCGCGCGTGCGC. +,,A,,A,,A,..........aa.. DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC 2L 24 * */* 61 0 14 33 * -ggcgcgcgtgcgc +32 1 0 0 0 2L 25 a A 163 0 15 49 ..$,..,,,t,.T,,,..,,,,T, +-7attattt,,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCC +BCCCCBCCCCCCDCCCCCCCCBCBCDCC 2L 25 * */-attattt 27 27 15 49 * -attattt + 47 2 0 0 0 2L 26 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT +.^!.^!, CCC8CCCCBCA 2L 26 * */-tctt 17 17 0 11 * -tctt 8 +3 0 0 0 2L 27 T K 18 18 26 7 .Ggg... ACCCCCC 2L 28 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBC +CBBC@3CCBBD=? 2L 29 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CD +CCDACB9CCA@B?@D 2L 30 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, + CDCCCAC?BCBBCD?<DBBB 2L 31 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, C +BCDABC>BCACBB:AC?4C 2L 32 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. + CDCCCAC@@CDDCC<:DBABC 2L 33 c C 117 0 14 31 ...,,*.,,-2gt.,,.,,.,... +.......,,.. ;B6CCBBCBCACDCCCB@DCDCC8CCCABCC
      Output:
      >input.fasta GATCCccagagcgttGAwgcttAgacTCCgagc >input.fasta AAKMMCYA----GNTGMTGCTTARWC----AGC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://863010]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2024-04-23 10:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found