Memory usage while tallying instances of lines in a .txt file

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

Given input data in the form of:

c    8    336158    75    75M    74
c    12    828707    74    74M    73
w    10    528559    74    74M    0
c    15    267766    74    74M    73
c    12    828707    74    74M    73
c    14    491797    74    74M    73
[download]

I am trying to tally the instances of records based on columns 1 (which has the header 'Strand' - this can be variable in position hence the use of List::Util qw(first)) as well as columns 2 and 3. The main chunk of code that accomplishes this is simply:

    my @headers = split("\t",<$IN>);
    my $index = first{$headers[$_] eq 'Strand'} 0..$#headers;
    while (<$IN>) {
        chomp $_;
        my @F = split("\t", $_);
        if (exists $hits{$F[$index+1]}{$F[$index+2]}) {
        } else {
            $hits{$F[$index+1]}{$F[$index+2]}{'w'} = 0;
            $hits{$F[$index+1]}{$F[$index+2]}{'c'} = 0;
        }
        $hits{$F[$index+1]}{$F[$index+2]}{$F[$index]}++
    }
[download]

This is then printed in a simple manner to form files like these:

1    4    1    0
1    5    1    0
1    31    1    0
1    74    1    0
1    89    1    0
1    116    1    1
1    118    1    0
1    122    1    0
1    126    0    1
1    140    0    1
1    141    0    1
1    148    2    0
1    158    0    1
1    159    1    0
[download]

Column 2 and 3, along with the frequency counts of each for W and C.

This approach however requires a rather a lot of memory - around 800MB for an input file of ~100Mb.

Are there any clever tricks or alternative methods that I could use in order to reduce the memory requirements? I note that for any given column 2-column 3 combination, a key and a blank (zeroed) value is stored the first time it is encountered - this is done as the output file is required in the format shown above where '0' is filled in. This may be increasing memory usage further when the zeros could be added afterward (perhaps during printing), but i'm entirely sure or how I would do this.

Comment on Memory usage while tallying instances of lines in a .txt file Select or Download Code

Replies are listed 'Best First'.
Re: Memory usage while tallying instances of lines in a .txt file by choroba (Cardinal) on Dec 05, 2016 at 16:48 UTC
Help us to help you! Next time, please post code and data we can easily test. See my code below for an example - it should work right ahead. To print 0 instead of undef, you can use the defined-or `//` operator. It needs Perl 5.10, otherwise you have to be more verbose (`defined $_ ? $_ : 0`). #!/usr/bin/env perl use warnings; use strict; use feature qw{ say }; use List::Util qw{ first }; my $IN = DATA{IO}; my %hits; my @headers = split ' ', <$IN>; my $index = first { $headers[$_] eq 'Strand' } 0 .. $#headers; while (<$IN>) { chomp; my @F = split ' '; $hits{ $F[ $index + 1 ] }{ $F[ $index + 2 ] }{ $F[$index] }++; } for my $key (keys %hits) { for my $inner_key (keys %{ $hits{$key} }) { say join "\t", $key, $inner_key, map $_ // 0, @{ $hits{$key}{$inner_key} }{qw{ c w }}; } } __DATA__ Strand c 8 336158 75 75M 74 c 12 828707 74 74M 73 w 10 528559 74 74M 0 c 15 267766 74 74M 73 c 12 828707 74 74M 73 c 14 491797 74 74M 73 [download] Have you noticed I used `$_` only in map and first? Both chomp and split work with it by default. If you feel the need to explicitly type the argument, name the variable. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7*2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: Memory usage while tallying instances of lines in a .txt file by TJCooper (Beadle) on Dec 05, 2016 at 18:04 UTC
Thanks. This saves about 300-350MB of RAM! I was not at all aware of the // operator so decided not to rely on autovivification and opted for the approach shown in the OP.	[reply]
Re^3: Memory usage while tallying instances of lines in a .txt file by BrowserUk (Patriarch) on Dec 05, 2016 at 20:19 UTC
The following code produces identical results to choroba's code but uses less than 1/4 of the memory (180MB vs 795MB for my test dataset) and runs more quickly: #! perl -slw use strict; use List::Util qw[ first ]; my @headers = split ' ', scalar <>; my $f = first { $headers[$_] eq 'Strand' } 0 .. $#headers; my( $cCounts, $wCounts, $n, %index ) = ( '', '', 0 ); while( <> ) { chomp; my @F = split ' '; my $index = $index{ $F[ $f+1 ] }{ $F[ $f + 2 ] } //= $n++; ++vec( $F[ $f ] eq 'w' ? $wCounts : $cCounts, $index, 8 ); } while( my( $key, $subhash ) = each %index ) { while( my( $subkey, $index ) = each %{ $subhash } ) { print join "\t", $key, $subkey, vec( $cCounts, $index, 8 ), ve +c( $wCounts, $index, 8 ); } } __END__ 1177246.pl 1177246.dat > 1177246.out [download] It assumes no count will be greater than 256. If that's too small, change the three 8s to 16s for a small increase in memory consumption. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Memory usage while tallying instances of lines in a .txt file by RonW (Parson) on Dec 07, 2016 at 22:11 UTC
Re^5: Memory usage while tallying instances of lines in a .txt file by BrowserUk (Patriarch) on Dec 08, 2016 at 00:57 UTC
Re: Memory usage while tallying instances of lines in a .txt file by dave_the_m (Monsignor) on Dec 05, 2016 at 16:44 UTC
How many lines is the input file? columns 2 and 3: are they always integers; do they have any well-defined min and max possible values; is their disturbution sparse (e.g. is it possible col 3 might have values 336158, and 336159, but then nothing till 491797, or is it likely that most of the gaps inbetween will appear at some point)? Dave.	[reply]
Re^2: Memory usage while tallying instances of lines in a .txt file by TJCooper (Beadle) on Dec 05, 2016 at 17:23 UTC
The input files can contain millions of lines - with around 1-1.5m unique entries that will be tallied up. The entries are indeed sparse.	[reply]
Re: Memory usage while tallying instances of lines in a .txt file by kcott (Archbishop) on Dec 06, 2016 at 05:17 UTC
G'day TJCooper, I see you already have answers regarding the main thrust of your question, i.e. "Memory usage". My response here touches on other aspects of your posted code. Using `[$index]`, `[$index+1]` and `[$index+2]` does not make it clear what data you're accessing. This results in code that's more difficult to read and maintain, as well as making it more error-prone. Consider the improvement in clarity if those appeared as these alternatives: `[$index] -> [$index_of{Strand}] [$index+1] -> [$index_of{Type}] [$index+2] -> [$index_of{Pos}]` [download] In "Re^2: Memory usage while tallying instances of lines in a .txt file", you show two potential formats for your input data. In the first format, the wanted columns are in the order that you've hard-coded them; in the second, the hard-coded order stays the same but they're in different positions (because an additional column has been added before them). Given your input is variable, it could potentially take on other variances in the future; for instance, an additional column could be added between your wanted columns or the order of those columns could change. You can achieve the improvement in clarity indicated above, get rid of the need to load a module (i.e. List::Util) to handle a few dozen bytes of an 800MB file, and protect yourself against future changes, with this line of code: `@index_of{@headers} = 0 .. $#headers;` [download] See "perldata: Slices" if you're unfamiliar with that construct. Here's example code using your two current formats and two potential future ones: `#!/usr/bin/env perl -l use strict; use warnings; my @test_headers = ( [qw{Strand Type Pos Length Form Adjustment}], [qw{ID Strand Type Pos Length Form Adjustment}], [qw{Strand XXX Type Pos Length Form Adjustment}], [qw{Pos Type Length Strand Form Adjustment}], ); for (@test_headers) { my @headers = @$_; my %index_of; @index_of{@headers} = 0 .. $#headers; print "Headers: @headers"; print "Strand index: $index_of{Strand}"; print "Type index: $index_of{Type}"; print "Pos index: $index_of{Pos}"; }` [download] Output: `Headers: Strand Type Pos Length Form Adjustment Strand index: 0 Type index: 1 Pos index: 2 Headers: ID Strand Type Pos Length Form Adjustment Strand index: 1 Type index: 2 Pos index: 3 Headers: Strand XXX Type Pos Length Form Adjustment Strand index: 0 Type index: 2 Pos index: 3 Headers: Pos Type Length Strand Form Adjustment Strand index: 3 Type index: 1 Pos index: 0` [download] Another potential improvement would be to consider reading your input with Text::CSV (and, if you also have Text::CSV_XS installed, it will run more quickly). The CSV stands for comma-separated values; however, by changing the "`sep_char`" attribute, it works equally well for tab-, pipe-, whatever-separated values. Whenever you need to deal with data in these types of formats, I'd recommend reaching for this module first and only attempting to roll your own custom solution as a last resort. — Ken	[reply] [d/l] [select]
Re: Memory usage while tallying instances of lines in a .txt file by SuicideJunkie (Vicar) on Dec 05, 2016 at 16:47 UTC
Change the while into `while (my $line = <$IFH>)` along with the corresponding references to `$_` to `$line`. Also, there doesn't seem to be any need to test for the existence of the hits keys; simply let them autovivify. You can use `...{w}//0` and `...{c}//0` when printing in case there was only one of the values incremented. That will simplify your main loop to three lines.	[reply] [d/l] [select]
Re: Memory usage while tallying instances of lines in a .txt file by Discipulus (Canon) on Dec 06, 2016 at 08:51 UTC
Hello, you got quality answers; just a minor tip to add for my part. If using one letter only var name is a bad thing, using the same letter of a perl `perlvar` is very bad habit. Infact `@F` is the field array used by `-a` perl's switch. Even if you have it localized in your example i think is a practice to avoid. pardon my pedantry visiting this node about file speed! L* update:fixed `perlvar` link thanks to pryrt There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re: Memory usage while tallying instances of lines in a .txt file by stevieb (Canon) on Dec 05, 2016 at 16:42 UTC
This is because you're reading the entire file into an array. To reduce the memory footprint, you're better off reading the file a line at a time. Can you please show us an example of the line you're doing this on?: `my $index = first{$headers[$_] eq 'Strand'} 0..$#headers;` [download] We should be able to help you rewrite your code if we know where `$index` is being gleaned from.	[reply] [d/l] [select]
Re^2: Memory usage while tallying instances of lines in a .txt file by TJCooper (Beadle) on Dec 05, 2016 at 17:25 UTC
The intention is to grab $index from the headerline of the .txt file (which only appears once on line-1). It's nothing more than a set of tab-delimited headers: `Strand Type Pos Length Form Adjustment` However it can sometimes take the form: `ID Strand Type Pos Length Form Adjustment`	[reply] [d/l] [select]
Re^3: Memory usage while tallying instances of lines in a .txt file by stevieb (Canon) on Dec 05, 2016 at 17:59 UTC
The following code does what you want, ie. "Strand" can be at any position on the first line, and it removes the extreme memory overhead of reading in the whole file at once. `use warnings; use strict; use Data::Dumper; use List::Util qw(first); my %hits; my $index; open my $fh, '<', 'file.txt' or die $!; while (<$fh>){ chomp; my @F = split ' '; if (/Strand/){ $index = first { $F[$_] eq 'Strand' } 0..$#F; next; } if (! exists $hits{$F[$index+1]}{$F[$index+2]}) { $hits{$F[$index+1]}{$F[$index+2]}{'w'} = 0; $hits{$F[$index+1]}{$F[$index+2]}{'c'} = 0; } $hits{$F[$index+1]}{$F[$index+2]}{$F[$index]}++; } print Dumper \%hits;` [download] Data used: `Strand 1 4 1 0 1 5 1 0 1 31 1 0 1 74 1 0` [download]	[reply] [d/l] [select]
Re^4: Memory usage while tallying instances of lines in a .txt file by TJCooper (Beadle) on Dec 05, 2016 at 18:42 UTC

Back to Seekers of Perl Wisdom