G'day TJCooper,
I see you already have answers regarding the main thrust of your question, i.e. "Memory usage".
My response here touches on other aspects of your posted code.
Using [$index], [$index+1] and [$index+2] does not make it clear what data you're accessing.
This results in code that's more difficult to read and maintain, as well as making it more error-prone.
Consider the improvement in clarity if those appeared as these alternatives:
[$index] -> [$index_of{Strand}]
[$index+1] -> [$index_of{Type}]
[$index+2] -> [$index_of{Pos}]
In "Re^2: Memory usage while tallying instances of lines in a .txt file", you show two potential formats for your input data.
In the first format, the wanted columns are in the order that you've hard-coded them;
in the second, the hard-coded order stays the same but they're in different positions
(because an additional column has been added before them).
Given your input is variable, it could potentially take on other variances in the future; for instance,
an additional column could be added between your wanted columns or the order of those columns could change.
You can achieve the improvement in clarity indicated above,
get rid of the need to load a module (i.e. List::Util)
to handle a few dozen bytes of an 800MB file,
and protect yourself against future changes,
with this line of code:
@index_of{@headers} = 0 .. $#headers;
See "perldata: Slices" if you're unfamiliar with that construct.
Here's example code using your two current formats and two potential future ones:
#!/usr/bin/env perl -l
use strict;
use warnings;
my @test_headers = (
[qw{Strand Type Pos Length Form Adjustment}],
[qw{ID Strand Type Pos Length Form Adjustment}],
[qw{Strand XXX Type Pos Length Form Adjustment}],
[qw{Pos Type Length Strand Form Adjustment}],
);
for (@test_headers) {
my @headers = @$_;
my %index_of;
@index_of{@headers} = 0 .. $#headers;
print "Headers: @headers";
print "Strand index: $index_of{Strand}";
print "Type index: $index_of{Type}";
print "Pos index: $index_of{Pos}";
}
Output:
Headers: Strand Type Pos Length Form Adjustment
Strand index: 0
Type index: 1
Pos index: 2
Headers: ID Strand Type Pos Length Form Adjustment
Strand index: 1
Type index: 2
Pos index: 3
Headers: Strand XXX Type Pos Length Form Adjustment
Strand index: 0
Type index: 2
Pos index: 3
Headers: Pos Type Length Strand Form Adjustment
Strand index: 3
Type index: 1
Pos index: 0
Another potential improvement would be to consider reading your input with Text::CSV (and, if you also have Text::CSV_XS installed, it will run more quickly).
The CSV stands for comma-separated values; however, by changing the "sep_char" attribute,
it works equally well for tab-, pipe-, whatever-separated values.
Whenever you need to deal with data in these types of formats,
I'd recommend reaching for this module first and only attempting to roll your own custom solution as a last resort.
|