TJCooper has asked for the wisdom of the Perl Monks concerning the following question:
Given input data in the form of:
c 8 336158 75 75M 74 c 12 828707 74 74M 73 w 10 528559 74 74M 0 c 15 267766 74 74M 73 c 12 828707 74 74M 73 c 14 491797 74 74M 73
I am trying to tally the instances of records based on columns 1 (which has the header 'Strand' - this can be variable in position hence the use of List::Util qw(first)) as well as columns 2 and 3. The main chunk of code that accomplishes this is simply:
This is then printed in a simple manner to form files like these:my @headers = split("\t",<$IN>); my $index = first{$headers[$_] eq 'Strand'} 0..$#headers; while (<$IN>) { chomp $_; my @F = split("\t", $_); if (exists $hits{$F[$index+1]}{$F[$index+2]}) { } else { $hits{$F[$index+1]}{$F[$index+2]}{'w'} = 0; $hits{$F[$index+1]}{$F[$index+2]}{'c'} = 0; } $hits{$F[$index+1]}{$F[$index+2]}{$F[$index]}++ }
1 4 1 0 1 5 1 0 1 31 1 0 1 74 1 0 1 89 1 0 1 116 1 1 1 118 1 0 1 122 1 0 1 126 0 1 1 140 0 1 1 141 0 1 1 148 2 0 1 158 0 1 1 159 1 0
Column 2 and 3, along with the frequency counts of each for W and C.
This approach however requires a rather a lot of memory - around 800MB for an input file of ~100Mb.
Are there any clever tricks or alternative methods that I could use in order to reduce the memory requirements? I note that for any given column 2-column 3 combination, a key and a blank (zeroed) value is stored the first time it is encountered - this is done as the output file is required in the format shown above where '0' is filled in. This may be increasing memory usage further when the zeros could be added afterward (perhaps during printing), but i'm entirely sure or how I would do this.
|
---|