comment on

I did the following:

perl -e 'open my $outfh, ">", "sample.txt"; while ($i++ < 50_000_000) 
+{print $outfh "abcdefghijklmnopqrstuvwxyz0123456789\n";}'
[download]

On my laptop with an SSD that took about fifteen seconds to run. Then I did this:

perl -E 'open my $infh, "<", "sample.txt"; while(<$infh>) {$i++} say $
+i;'
[download]

And that took about eight seconds to run. In the case of your code, within the while() {...} loop you're invoking the regex engine, doing a capture, and pushing onto two arrays. If you have "hits" in the case of, say, 50% of the lines from your file, you'll be pushing 25 million captures into the arrays. Depending on the size of your captures, you could have one to several gigabytes stored in the arrays.

If your run-times for the code segment you demonstrated are under 30-45 seconds, you're probably doing about as best as can be expected for a single process working with a file. If the time is over a couple minutes, you're probably swamping memory and doing a lot of paging out behind the scenes. If that's the case, consider instead of pushing into @good and @sample arrays, writing entries to a couple of output files. This will add IO overhead to the process, but will remove the memory impact which is probably generating even more IO overhead behind the scenes at a much lower layer.

Once the 'sample' and 'good' files are written, you can process them line by line to do with them what you would have done with the arrays. Another alternative would be instead of pushing onto @sample and @good, do the processing that will later happen on @sample and @good just in time for each line of the input file. IE:

my %dispatch = (
    sample =>  sub {my $capture = shift; # do something with $capture}
+,
    good   =>  sub {my $capture = shift; # do something with $capture}
+,
);

while(<FILE>) {
    if (/^(sample|good)\s+(\S+)/) {
        $dispatch{$1}->($2);
    }
}
[download]

As long as # do something with $capture does not include storing the entire capture into an array, this should pretty much wipe out the large memory footprint.

Dave

In reply to Re: About text file parsing by davido
in thread About text file parsing by dideod.yang

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks