http://qs321.pair.com?node_id=1177241

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

Given input data in the form of:

c 8 336158 75 75M 74 c 12 828707 74 74M 73 w 10 528559 74 74M 0 c 15 267766 74 74M 73 c 12 828707 74 74M 73 c 14 491797 74 74M 73

I am trying to tally the instances of records based on columns 1 (which has the header 'Strand' - this can be variable in position hence the use of List::Util qw(first)) as well as columns 2 and 3. The main chunk of code that accomplishes this is simply:

my @headers = split("\t",<$IN>); my $index = first{$headers[$_] eq 'Strand'} 0..$#headers; while (<$IN>) { chomp $_; my @F = split("\t", $_); if (exists $hits{$F[$index+1]}{$F[$index+2]}) { } else { $hits{$F[$index+1]}{$F[$index+2]}{'w'} = 0; $hits{$F[$index+1]}{$F[$index+2]}{'c'} = 0; } $hits{$F[$index+1]}{$F[$index+2]}{$F[$index]}++ }
This is then printed in a simple manner to form files like these:
1 4 1 0 1 5 1 0 1 31 1 0 1 74 1 0 1 89 1 0 1 116 1 1 1 118 1 0 1 122 1 0 1 126 0 1 1 140 0 1 1 141 0 1 1 148 2 0 1 158 0 1 1 159 1 0

Column 2 and 3, along with the frequency counts of each for W and C.

This approach however requires a rather a lot of memory - around 800MB for an input file of ~100Mb.

Are there any clever tricks or alternative methods that I could use in order to reduce the memory requirements? I note that for any given column 2-column 3 combination, a key and a blank (zeroed) value is stored the first time it is encountered - this is done as the output file is required in the format shown above where '0' is filled in. This may be increasing memory usage further when the zeros could be added afterward (perhaps during printing), but i'm entirely sure or how I would do this.

Replies are listed 'Best First'.
Re: Memory usage while tallying instances of lines in a .txt file
by choroba (Cardinal) on Dec 05, 2016 at 16:48 UTC
    Help us to help you! Next time, please post code and data we can easily test. See my code below for an example - it should work right ahead.

    To print 0 instead of undef, you can use the defined-or // operator. It needs Perl 5.10, otherwise you have to be more verbose (defined $_ ? $_ : 0).

    #!/usr/bin/env perl use warnings; use strict; use feature qw{ say }; use List::Util qw{ first }; my $IN = *DATA{IO}; my %hits; my @headers = split ' ', <$IN>; my $index = first { $headers[$_] eq 'Strand' } 0 .. $#headers; while (<$IN>) { chomp; my @F = split ' '; $hits{ $F[ $index + 1 ] }{ $F[ $index + 2 ] }{ $F[$index] }++; } for my $key (keys %hits) { for my $inner_key (keys %{ $hits{$key} }) { say join "\t", $key, $inner_key, map $_ // 0, @{ $hits{$key}{$inner_key} }{qw{ c w }}; } } __DATA__ Strand c 8 336158 75 75M 74 c 12 828707 74 74M 73 w 10 528559 74 74M 0 c 15 267766 74 74M 73 c 12 828707 74 74M 73 c 14 491797 74 74M 73

    Have you noticed I used $_ only in map and first? Both chomp and split work with it by default. If you feel the need to explicitly type the argument, name the variable.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thanks. This saves about 300-350MB of RAM! I was not at all aware of the // operator so decided not to rely on autovivification and opted for the approach shown in the OP.

        The following code produces identical results to choroba's code but uses less than 1/4 of the memory (180MB vs 795MB for my test dataset) and runs more quickly:

        #! perl -slw use strict; use List::Util qw[ first ]; my @headers = split ' ', scalar <>; my $f = first { $headers[$_] eq 'Strand' } 0 .. $#headers; my( $cCounts, $wCounts, $n, %index ) = ( '', '', 0 ); while( <> ) { chomp; my @F = split ' '; my $index = $index{ $F[ $f+1 ] }{ $F[ $f + 2 ] } //= $n++; ++vec( $F[ $f ] eq 'w' ? $wCounts : $cCounts, $index, 8 ); } while( my( $key, $subhash ) = each %index ) { while( my( $subkey, $index ) = each %{ $subhash } ) { print join "\t", $key, $subkey, vec( $cCounts, $index, 8 ), ve +c( $wCounts, $index, 8 ); } } __END__ 1177246.pl 1177246.dat > 1177246.out

        It assumes no count will be greater than 256. If that's too small, change the three 8s to 16s for a small increase in memory consumption.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Memory usage while tallying instances of lines in a .txt file
by dave_the_m (Monsignor) on Dec 05, 2016 at 16:44 UTC
    How many lines is the input file?

    columns 2 and 3: are they always integers; do they have any well-defined min and max possible values; is their disturbution sparse (e.g. is it possible col 3 might have values 336158, and 336159, but then nothing till 491797, or is it likely that most of the gaps inbetween will appear at some point)?

    Dave.

      The input files can contain millions of lines - with around 1-1.5m unique entries that will be tallied up. The entries are indeed sparse.
Re: Memory usage while tallying instances of lines in a .txt file
by kcott (Archbishop) on Dec 06, 2016 at 05:17 UTC

    G'day TJCooper,

    I see you already have answers regarding the main thrust of your question, i.e. "Memory usage". My response here touches on other aspects of your posted code.

    Using [$index], [$index+1] and [$index+2] does not make it clear what data you're accessing. This results in code that's more difficult to read and maintain, as well as making it more error-prone. Consider the improvement in clarity if those appeared as these alternatives:

    [$index] -> [$index_of{Strand}] [$index+1] -> [$index_of{Type}] [$index+2] -> [$index_of{Pos}]

    In "Re^2: Memory usage while tallying instances of lines in a .txt file", you show two potential formats for your input data. In the first format, the wanted columns are in the order that you've hard-coded them; in the second, the hard-coded order stays the same but they're in different positions (because an additional column has been added before them). Given your input is variable, it could potentially take on other variances in the future; for instance, an additional column could be added between your wanted columns or the order of those columns could change.

    You can achieve the improvement in clarity indicated above, get rid of the need to load a module (i.e. List::Util) to handle a few dozen bytes of an 800MB file, and protect yourself against future changes, with this line of code:

    @index_of{@headers} = 0 .. $#headers;

    See "perldata: Slices" if you're unfamiliar with that construct. Here's example code using your two current formats and two potential future ones:

    #!/usr/bin/env perl -l use strict; use warnings; my @test_headers = ( [qw{Strand Type Pos Length Form Adjustment}], [qw{ID Strand Type Pos Length Form Adjustment}], [qw{Strand XXX Type Pos Length Form Adjustment}], [qw{Pos Type Length Strand Form Adjustment}], ); for (@test_headers) { my @headers = @$_; my %index_of; @index_of{@headers} = 0 .. $#headers; print "Headers: @headers"; print "Strand index: $index_of{Strand}"; print "Type index: $index_of{Type}"; print "Pos index: $index_of{Pos}"; }

    Output:

    Headers: Strand Type Pos Length Form Adjustment Strand index: 0 Type index: 1 Pos index: 2 Headers: ID Strand Type Pos Length Form Adjustment Strand index: 1 Type index: 2 Pos index: 3 Headers: Strand XXX Type Pos Length Form Adjustment Strand index: 0 Type index: 2 Pos index: 3 Headers: Pos Type Length Strand Form Adjustment Strand index: 3 Type index: 1 Pos index: 0

    Another potential improvement would be to consider reading your input with Text::CSV (and, if you also have Text::CSV_XS installed, it will run more quickly). The CSV stands for comma-separated values; however, by changing the "sep_char" attribute, it works equally well for tab-, pipe-, whatever-separated values. Whenever you need to deal with data in these types of formats, I'd recommend reaching for this module first and only attempting to roll your own custom solution as a last resort.

    — Ken

Re: Memory usage while tallying instances of lines in a .txt file
by SuicideJunkie (Vicar) on Dec 05, 2016 at 16:47 UTC

    Change the while into while (my $line = <$IFH>) along with the corresponding references to $_ to $line.

    Also, there doesn't seem to be any need to test for the existence of the hits keys; simply let them autovivify. You can use ...{w}//0 and ...{c}//0 when printing in case there was only one of the values incremented. That will simplify your main loop to three lines.

Re: Memory usage while tallying instances of lines in a .txt file
by Discipulus (Canon) on Dec 06, 2016 at 08:51 UTC
    Hello,

    you got quality answers; just a minor tip to add for my part. If using one letter only var name is a bad thing, using the same letter of a perl perlvar is very bad habit.

    Infact @F is the field array used by -a perl's switch. Even if you have it localized in your example i think is a practice to avoid.

    pardon my pedantry visiting this node about file speed!

    L*

    update:fixed perlvar link thanks to pryrt

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Memory usage while tallying instances of lines in a .txt file
by stevieb (Canon) on Dec 05, 2016 at 16:42 UTC

    This is because you're reading the entire file into an array. To reduce the memory footprint, you're better off reading the file a line at a time.

    Can you please show us an example of the line you're doing this on?:

    my $index = first{$headers[$_] eq 'Strand'} 0..$#headers;

    We should be able to help you rewrite your code if we know where $index is being gleaned from.

      The intention is to grab $index from the headerline of the .txt file (which only appears once on line-1). It's nothing more than a set of tab-delimited headers:

      Strand    Type    Pos    Length    Form    Adjustment

      However it can sometimes take the form:

      ID   Strand    Type    Pos    Length    Form    Adjustment

        The following code does what you want, ie. "Strand" can be at any position on the first line, and it removes the extreme memory overhead of reading in the whole file at once.

        use warnings; use strict; use Data::Dumper; use List::Util qw(first); my %hits; my $index; open my $fh, '<', 'file.txt' or die $!; while (<$fh>){ chomp; my @F = split ' '; if (/Strand/){ $index = first { $F[$_] eq 'Strand' } 0..$#F; next; } if (! exists $hits{$F[$index+1]}{$F[$index+2]}) { $hits{$F[$index+1]}{$F[$index+2]}{'w'} = 0; $hits{$F[$index+1]}{$F[$index+2]}{'c'} = 0; } $hits{$F[$index+1]}{$F[$index+2]}{$F[$index]}++; } print Dumper \%hits;

        Data used:

        Strand 1 4 1 0 1 5 1 0 1 31 1 0 1 74 1 0