how to process very large tab limited pileup file

gudluck has asked for the wisdom of the Perl Monks concerning the following question:

I need to process very large (>10GB) tab delimited Pileup files.

I am reading the file and if it satisfies some criteria iam concatinating colum4 letters to a string. I have few issues to address. 1. Some times a line before indel (* in 3rd(t2) column) have multiple extra strings representing an insert (+1A or +4caaa) or deletion (-4gtct or -4tctt or -13GGCGCGCGTGCGC ) strings in the read colum (in column 9; t8) how to get rid of them. I tried awk to preprocess the file to remove them by brute force awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' but suppose if the nucleotide immediately after the above string (-4gtct) is not a (.,) and if it is (atgcATGC) then it will also be removed. I need to specifically look for +/- pattern followed by a number(digits) and the deleting that many number of letters following it (look for -4 then remove -4 and gtct)

2. when ever there is an indel line (* in 3rd(t2) column) I need to check colum 4 (t3) and see if it is (*/* or */+ or */- or +/+ or +/- or -/- or -/+ ) insertion (+) or deletion (-) on the numerator side (ie +/- means i need to treat it as insert(+) only and if it is */- or */+ then i need to check if denominator (- or +) is represented by more than 5% of reads (i.e column12/column8) then for deletion (-) convert that many nucleotides line following it to dash - in column 4 (t3). and if it is an insertion (+) then add those addition nucleotides to the string.

Please help with suggetions

representative sample Pileup file (it is continous from 1 to n(column2) (here lines are discontious as I pasted few representative lines here and there)

#t0 t1  t2  t3  t4  t5  t6  t7  t8  t9
2L  1   G   A   48  48  26  7   AAaaAAA ACCCD?@
2L  2   A   A   48  0   26  7   ..,,... AACC@BC
2L  3   T   K   18  18  26  7   .Ggg... ACCCCCC
2L  4   C   C   57  0   37  16  .,..,,.,a.,,,,,^F,  CBCCBBC@3CCBBD=?
2L  5   C   C   54  0   37  17  .,..,,.,a.,,,,,,^F, CDCCDACB9CCA@B?@D
2L  6   c   C   78  0   37  17  .,..,,.,,.,,,,,,,   CBCCBAC>9CBB@BA?A
2L  7   c   C   45  0   37  19  .,..,,.,,.+1A,,,,t,,^F,^Ft  CDCCDAC>;C
+BCBD*?DAA
2L  7   *   */* 55  0   37  19  *   A   18  1   0   0   0
2L  8   a   A   162 0   21  45  ,-4gtct,-4gtct,,....,,,,,,.,.,,,,.,,,,
+,,,,...,.,.,,,,.,,,   CCCCA<;=/CCACDBCBCCCCDCCCCBBCCBDCCCCCDCDDCCAB
2L  8   *   */* 173 0   21  45  *   -gtct   43  2   0   0   0
2L  9   g   G   87  0   37  20  .,..,,.,,.,,,,,,,,,^F,  CDCCCAC?BCBBCD
+?<DBBB
2L  10  a   A   87  0   37  20  .,..,,.,,.,,,,,,,,,,    CBCDABC>BCACBB
+:AC?4C
2L  11  g   G   90  0   37  21  .,..,,.,,.,,,,,,,,,,^F. CDCCCAC@@CDDCC
+<:DBABC
2L  12  c   C   117 0   14  31  ...,,*.,,-2gt.,,.,,.,..........,,.. ;B
+6CCBBCBCACDCCCB@DCDCC8CCCABCC
2L  12  *   */-gt   21  21  14  31  *   -gt 30  1   0   0   0
2L  13  g   G   90  0   37  21  .,..,,.,,.,,,,,,,,,,.   CDCCCAC?:CDB5?
+<>BB>BC
2L  14  t   T   45  0   37  6   ..,,,.-1T   CCCCCC
2L  14  *   -t/-t   56  59  37  6   -t  *   1   5   0   0   0
2L  15  t   T   93  0   37  22  .,..,,.,,.,,,G,,,,,,.^F.    CDCDDDC<@C
+BB5B@ABC<ACC
2L  16  G   G   178 0   36  50  .$,$,$,$,,,,,.,,,,..,.,,...,,,,.,,...,
+....,...,,,$,,,., BCCCCCCCCCCCCBDBCBCCB>8CCCCBCCBDBC8>ACCBB6CC?CCC6C
2L  17  A   A   59  0   36  45  ,$,$,,,C,,,c..,.,,C..,,,,.,,...,....,.
+.C,,,,,$.,    CCCCCBCCCBB8CACC>@;CCCCDCCCDAC-5ABCAA2CCCC?1A
2L  18  w   T   54  54  37  9   tTTTTttTT   CCDCCDCCC
2L  18  *   A/+A    65  279 37  9   A   *   8   1   0   0   0
2L  19  g   G   54  0   37  9   ....,,...   >>@@CC>BB
2L  20  c   C   57  0   37  10  ....,,...^F,    CACCCC?CBB
2L  21  t   T   57  0   37  10  ....,,...,  BCDCCCCCB?
2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?
2L  650 *   */* 9   0   34  7   *   +CAAA   5   2   0   0   0
2L  654 A   A   48  0   34  7   .$.$,.,+1g,,    DBC?CCA
2L  654 *   */* 19  0   34  7   *   +G  6   1   0   0   0
2L  2332    g   G   60  0   14  33  .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A
+,..........aa..   DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC
2L  2332    *   */* 61  0   14  33  *   -ggcgcgcgtgcgc  32  1   0   0 
+  0
2L  3334    a   A   163 0   15  49  ..$,..,,,t,.T,,,..,,,,T,-7attattt,
+,-7attattt,,,,,,....,,......,.,,..    BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCC
+CCDCCCCCCCCBCBCDCC
2L  3334    *   */-attattt  27  27  15  49  *   -attattt    47  2   0 
+  0   0
2L  3928    c   C   32  0   0   11  ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!, 
+  CCC8CCCCBCA
2L  3928    *   */-tctt 157 157 0   11  *   -tctt   8   3   0   0   0
[download]

Comment on how to process very large tab limited pileup file Select or Download Code

Replies are listed 'Best First'.
Re: how to process very large tab limited pileup file by lima1 (Curate) on Oct 01, 2010 at 21:13 UTC
What have you tried so far, apart from this awk command? It's quite difficult to understand what you are trying to do... Show some code and explain where it fails and you will get help. If you don't know that already, you can and should use Text::CSV to parse the file.	[reply]
Re^2: how to process very large tab limited pileup file by gudluck (Novice) on Oct 01, 2010 at 21:31 UTC
My failed attempt: my $countref = ($t[8] =~ tr/.,//); my $ratioo = $countref/$t[7] ; if ($countref/$t[7] >0.95) { # print "$t[1]\t$t[2]\t$t[3]: $ratioo \n"; $seq .=$t[2]; } elsif ($countref/$t[7] <=0.95){ $t[8] =~ s/[.,]/$t[2]/g; my $counta = ($t[8] =~ tr/aA//); my $countt = ($t[8] =~ tr/tT//); my $countg = ($t[8] =~ tr/gG//); my $countc = ($t[8] =~ tr/cC//); my ($ratioa, $ratiot, $ratiog, $ratioc, $rat); my @rat = (); $ratioa = $counta/$t[7]; $ratiot = $countt/$t[7]; $ratiog = $countg/$t[7]; $ratioc = $countc/$t[7]; print " $t[1] : A $ratioa T $ratiot G $ratiog C $ratioc "; my $rrat = \@rat; push (@rat, "$ratioa", "$ratiot", "$ratiog", "$ratioc"); my $countnut = 0; my $i; my @ncode; my $ccode; for $i(0..$#rat){ if ( $rrat->[$i] >= 0.05){ $i++; push (@ncode, $i); $countnut ++; $i--; } $ccode = join ("", @ncode); if ($countnut >= 3) { # print "N\n"; $seq .= "N"; } #print " code:$ccode count $countnut\n"; else { my %hash; %hash = {1, 'A', 2, 'T', 3, 'G', 4, 'C', 12, 'W', 21, 'W', 13, 'R', 31, 'R', 14, 'M', 41, 'M', 23, 'K', 32, 'K', 24, 'Y', 42, 'Y', }; print "cc", $hash{$ccode},"\n" ; } } } [download]	[reply] [d/l]
Re^3: how to process very large tab limited pileup file by graff (Chancellor) on Oct 02, 2010 at 17:33 UTC
Like the other monks, I don't have a clue what your goals really are. (Try using shorter sentences when explaining things.) I took your code and data as posted, and put them together into a single file like this: `#!/usr/bin/perl while (<DATA>) { next if (/^#/ or /^\s*$/); chomp; @t = split /\t/; # your code goes here } __DATA__ # your data goes here` [download] (update: I initially forgot to mention "split") I had to edit the data to make it tab-delimited -- maybe the tabs were converted to spaces because of the way you posted it (or the way I downloaded it), but you may need to check and make sure whether your file really has the right number of tabs per line. Anyway, your code ran and produced output with no errors. So what's the problem? If you were expecting some different sort of output, you'll need to explain how the actual output differs from the desired output. I took the liberty of fixing the indentation in the code -- this really helps for making the code readable, and a good editor should make it easy to use proper indentation (e.g. emacs, vi, ...) I also did some refactoring, to remove unnecessary variables, make the output more informative, and use loops where possible. I couldn't tell what you were trying to do with your "$seq" variable... that (and the output format) is the only place where my version does something different from yours -- but I'm just guessing about what you really want. Read more... (7 kB)	[reply] [d/l] [select]
Re: how to process very large tab limited pileup file by johngg (Canon) on Oct 02, 2010 at 12:32 UTC
I would like to be able to help you but, as lima1 says, your description of the problem is difficult to understand. I'm afraid the code you posted in response to lima1's reply hasn't shed much light on the problem because you do not show how you are reading your file and we can only make assumptions regarding how the `@t` array is intialised. Also, what is the purpose of the `%hash` associative array (nice meaningful name, BTW :-)? It would help us to help you if you could provide the output you would expect to get from the data you show and if you could break down your description in paras. 1 & 2 into smaller steps it might help us understand better. I find the +/- etc. malarkey particularly unclear. Cheers, JohnGG	[reply] [d/l] [select]
Re: how to process very large tab limited pileup file by umasuresh (Hermit) on Oct 02, 2010 at 13:59 UTC
Can you post the desired output format? This might help in better understanding the problem.	[reply]
Re^2: how to process very large tab limited pileup file by gudluck (Novice) on Oct 06, 2010 at 23:20 UTC
Thanks a lot "graff" your inputs were quite useful. I could be able to write the code. (suggestions for the improvements are welcome) This code take the inputs: 1. raw pileup file and the 2. consensus fasta sequence(here it is named as "input.fasta"). and modifies the consensus fasta sequence based on the nucleotide frequences (if above 5%) at each position (from the read column (column 9). It also includes deletions if they above 5% in indel line (i.e. lines with * in the column 3) `>input.fasta GATCCccagagcgttGAwgcttAgacTCCgagc` [download] Find the complete code below: usage: perl modify.pileup.pl pileup2markedfasta input.pileup #!/usr/bin/perl use strict; use warnings; &usage if (@ARGV < 1); my $command = shift(@ARGV); my %func = (pileup2markedfasta=>\&pileup2fq); die("Unknown command \"$command\".\n") if (!defined($func{$command})); &{$func{$command}}; exit(0); # pileup2markedfasta sub pileup2fq { die(qq/ Usage: modify.pileup.pl pileup2marked.fasta [options] <in.cns-pileup +> /) if (@ARGV == 0 && -t STDIN); my %hash = ( 1 => 'A', 2 => 'T', 3 => 'G', 4 => 'C', 12 => 'W', 21 => 'W', 13 => 'R', 31 => 'R', 14 => 'M', 41 => 'M', 23 => 'K', 32 => 'K', 24 => 'Y', 42 => 'Y', 34 => 'S', 43 => 'S' ); my $seq; my $ConsFileName = "input.fasta" ; # Input single chromosome consensus + fasta file open(INFILE, "$ConsFileName"); my $header ; while(my $line = <INFILE>) { if ($line =~ m/^\>/){ $header = $line; # Fasta header is removed; otherwise counted } else{ chomp $line; # Remove the "\n" $seq .= $line; } } close(INFILE); print "$header"; print "$seq\n"; my @gps; my $dsize ; my $rseq = ''; my $report = ''; my @t; while (<>) { @t = split; if ($t[7] < 7){substr($seq, ($t[1]-1), 1, "N");} # minimum number ( +7) of reads elsif ($t[2] eq '') { # checking if it is indel in the raw pile +up if (($t[3] =~ /\\/\+/)\|\|($t[3] =~ /^\+/)){ #print "next $t[3] \n"; next; } elsif ( ($t[3] =~ /\\/\/) && ($t[9] =~ /^\-/) && ($t[11]/$t[7] > +=0.05) ){ my $lenn =(length($t[9])-1); for (1..$lenn){ push (@gps, '-') } my $ndash .= '-' x ($lenn); # print "$t[1]: $lenn, @gps, dash** $ndash\n"; substr($seq, $t[1], $lenn, "$ndash"); } elsif ( ($t[3] =~ /\\/\-/) && ($t[11]/$t[7] >=0.05) ){ my $lenn =(length($t[9])-1); for (1..$lenn){ push (@gps, '-') } my $ndash .= '-' x ($lenn); # print "$t[1] dash- $ndash\n"; substr($seq, $t[1], $lenn, "$ndash"); } elsif (($t[3] =~ /^\-/) && ($t[10]/$t[7] >=0.05)){ my $lenn =(length($t[8])-1); for (1..$lenn){ push (@gps, '-') } my $ndash .= '-' x ($lenn); # print "$t[1] 'dash-' $ndash\n"; substr($seq, $t[1], $lenn, "$ndash"); } } else { $dsize = scalar(@gps); if ($dsize >=1){shift @gps; #print "$dsize: $t[1]skippped\n"; next; } else { my $countref = ($t[8] =~ tr/.,//); if ( $countref / $t[7] > 0.95 ){next;} elsif ($countref/$t[7] <=0.95){ $t[8] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+//g; + # removes unwanted indels from read base lines $t[8] =~ s/[\@\'\&\!\^\$\\)]+//g; # removes garbage symbo +ls $t[8] =~ s/[.,]/$t[2]/g; my %ratio; my @letters = qw/a t g c/; $report = " $t[1] : "; for my $p ( @letters ) { #`echo "$t[1]"`; my $count = eval "lc( '$t[8]' ) =~ tr/$p//"; $ratio{$p} = $count / $t[7]; # print "$p: $ratio{$p}\n"; $report .= sprintf( "%s %5.3f ", uc( $p ), $ratio{$p} ); } my $countnut = 0; my @ncode =(); for my $i (0 .. $#letters) { if ( $ratio{ $letters[$i] } >= 0.05 ) { push @ncode, $i + 1; #print "$t[1], @ncode, $letters[$i]\n"; $countnut++; } } my $ccode = join ("", @ncode); chomp $ccode; if ( ! exists( $hash{$ccode} )) { # print "not $t[1], $ccode\n"; substr($seq, ($t[1]-1), 1, "N"); } else { # print "yes $t[1], $ccode\n"; substr($seq, ($t[1]-1), 1, "$hash{$ccode}"); } } } # printf( "Data_line %5d : t1= %5s ; report= %s \n", $., $t[1], $r +eport ); } } # end of while print "$header"; # printing back the fasta header $seq= uc($seq) ; # print "$seq\n"; ; &p2q_post_process(\$seq) ; #print "$header"; # printing back the fasta header } # end of pileup ################## sub p2q_post_process { my ($seq) = @_; print &p2q_print_str($seq); } # sub p2q_print_str { my ($s) = @_; my $l = length($$s); for (my $i = 0; $i < $l; $i += 60) { print substr($$s, $i, 60), "\n"; } } # sub usage { die(qq/ Program: modify.pileup.pl Usage: modify.pileup.pl <command> [<arguments>]\n Command: pileup2markedfasta generate maked.fasta from `pileup -c r +aw pileup' \n/); } [download] input.pileup 2L 1 G A 48 48 26 7 AAaaAAA ACCCD?@ 2L 2 A A 48 0 26 7 ..,,... AACC@BC 2L 3 T K 18 18 26 7 .Ggg... ACCCCCC 2L 4 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBCC +BBC@3CCBBD=? 2L 5 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CDC +CDACB9CCA@B?@D 2L 6 c C 78 0 37 17 .,..,,.,,.,,,,,,, CBCCB +AC>9CBB@BA?A 2L 7 c C 45 0 37 19 .,..,,.,,.+1A,,,,t,,^F,^Ft + CDCCDAC>;CBCBD?DAA 2L 7 * / 55 0 37 19 * A 18 1 0 + 0 0 2L 8 a A 162 0 21 45 ,-4gtct,-4gtct,,....,,,,, +,.,.,,,,.,,,,,,,,...,.,.,,,,.,,, CCCCA<;=/CCACDBCBCCCCDCCCCBBCCBDC +CCCCDCDDCCAB 2L 8 * / 173 0 21 45 * -gtct 43 3 + 0 0 0 2L 9 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, +CDCCCAC?BCBBCD?<DBBB 2L 10 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, C +BCDABC>BCACBB:AC?4C 2L 11 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. + CDCCCAC@@CDDCC<:DBABC 2L 12 c C 117 0 14 31 ...,,.,,-2gt.,,.,,.,... +.......,,.. ;B6CCBBCBCACDCCCB@DCDCC8CCCABCC 2L 12 /-gt 21 21 14 31 -gt 30 1 + 0 0 0 2L 13 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,. +CDCCCAC?:CDB5?<>BB>BC 2L 14 t T 45 0 37 6 ..,,,.-1T CCCCCC 2L 14 * -t/-t 56 59 37 6 -t * 1 5 +0 0 0 2L 15 t T 93 0 37 22 .,..,,.,,.,,,G,,,,,,.^F. + CDCDDDC<@CBB5B@ABC<ACC 2L 16 G G 178 0 36 50 .$,$,$,$,,,,,.,,,,..,.,, +...,,,,.,,...,....,...,,,$,,,., BCCCCCCCCCCCCBDBCBCCB>8CCCCBCCBDBC +8>ACCBB6CC?CCC6C 2L 17 A A 59 0 36 45 ,$,$,,,C,,,c..,.,,C..,,,, +.,,...,....,..C,,,,,$., CCCCCBCCCBB8CACC>@;CCCCDCCCDAC-5ABCAA2CCCC +?1A 2L 18 w T 54 54 37 9 tTTTTttTT CCDCCDCCC 2L 18 * A/+A 65 279 37 9 A * 8 1 0 + 0 0 2L 19 g G 54 0 37 9 ....,,... >>@@CC>BB 2L 20 c C 57 0 37 10 ....,,...^F, CACCCC?CB +B 2L 21 t T 57 0 37 10 ....,,..., BCDCCCCCB? 2L 22 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?C +CA=? 2L 22 * / 9 0 34 7 * +CAAA 5 2 0 + 0 0 2L 23 A A 48 0 34 7 .$.$,.,+1g,, DBC?CCA 2L 23 * / 19 0 34 7 * +G 6 1 0 + 0 0 2L 24 g G 60 0 14 33 .,...,,.-13GGCGCGCGTGCGC. +,,A,,A,,A,..........aa.. DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC 2L 24 * / 61 0 14 33 * -ggcgcgcgtgcgc +32 1 0 0 0 2L 25 a A 163 0 15 49 ..$,..,,,t,.T,,,..,,,,T, +-7attattt,,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCC +BCCCCBCCCCCCDCCCCCCCCBCBCDCC 2L 25 * /-attattt 27 27 15 49 -attattt + 47 2 0 0 0 2L 26 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT +.^!.^!, CCC8CCCCBCA 2L 26 * /-tctt 17 17 0 11 -tctt 8 +3 0 0 0 2L 27 T K 18 18 26 7 .Ggg... ACCCCCC 2L 28 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBC +CBBC@3CCBBD=? 2L 29 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CD +CCDACB9CCA@B?@D 2L 30 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, + CDCCCAC?BCBBCD?<DBBB 2L 31 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, C +BCDABC>BCACBB:AC?4C 2L 32 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. + CDCCCAC@@CDDCC<:DBABC 2L 33 c C 117 0 14 31 ...,,*.,,-2gt.,,.,,.,... +.......,,.. ;B6CCBBCBCACDCCCB@DCDCC8CCCABCC [download] Output: `>input.fasta GATCCccagagcgttGAwgcttAgacTCCgagc >input.fasta AAKMMCYA----GNTGMTGCTTARWC----AGC` [download]	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks