Processing large files

Dr Manhattan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I have a text file that I want to extract some information from, however the file is too large to read all at once. So at the moment I'm trying to read in 3000 lines at a time, processing and extracting info, printing it, clearing memory and then go on to the next 3000 lines.

This is the code I am currently trying out:

my @array;
my $counter = 0;

while (<Input>)
{
 my $line = $_;
 chomp $line;
 push (@array, $line)
 if ($counter = 3000)
 {
   my @information;
   foreach my $element (@array)
   {
     #extract info from $element and push into @information
   }
   for my $x (@information)
   {
     print Output "$x\n";
   }
   $counter = 0;
   @information = ();
 }
}
[download]

However when I try this the output file just never stops growing, so I think I might be creating a endless loop somewhere. Any ideas/pointers?

Thanks in advance for any help

Comment on Processing large files Download Code

Replies are listed 'Best First'.

Re: Processing large files
by BrowserUk (Patriarch) on Aug 21, 2013 at 06:25 UTC

if ($counter = 3000)

If you had warnings enabled, Perl would tell you:

if ($counter = 3000) { 1 };;
Found = in conditional, should be == at ...
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re: Processing large files
by Athanasius (Archbishop) on Aug 21, 2013 at 06:36 UTC

BrowserUk has identified the main problem. In addition:

Looks like @array’s memory is never cleared. Also, you don’t need to clear @information explicitly — it’s a lexical variable, so will be re-initialised each time the if condition is true. So, change that line to:

@array = ();
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Processing large files
by mtmcc (Hermit) on Aug 21, 2013 at 12:17 UTC

As well as the above points, you don't seem to increment $counter at any point either, so it won't reach 3000...

[reply]

Re: Processing large files
by kcott (Archbishop) on Aug 21, 2013 at 09:19 UTC

G'day Dr Manhattan,

This would be an ideal situation in which to use the built-in module Tie::File. That won't suffer from memory issues due to the size of the input file and would allow you to eliminate the need for the while loop, chomp, push, if condition and $counter. Also, you don't appear to be storing data in @information for subsequent use so you can eliminate that variable and the for loop that processes it. Here's roughly what you'd need:

use strict;
use warnings;
use autodie;

use Tie::File;

tie my @input_data, 'Tie::File', 'your_input_filename';
open my $output_fh, '>', 'your_output_filename';

for my $record (@input_data) {
    my $extracted_info = ...;  # extract info from $record here
    print $output_fh "$extracted_info\n";
}

untie @input_data;
close $output_fh;
[download]

-- Ken

[reply]
[d/l]
[select]

Re: Processing large files
by derby (Abbot) on Aug 21, 2013 at 11:32 UTC

Others have pointed out the real problem with your code but I would like to point out that with most IO architectures, you're probably not going to gain much performance by buffering your input this way -- the underlying library calls for read are probably already buffering. You may want to Benchmark your buffering approach with a standard line-by-line approach. If the differences are minimal, I would opt for the simpler code.

-derby

[reply]

Re: Processing large files
by Laurent_R (Canon) on Aug 21, 2013 at 11:39 UTC

Why don't you simply read one line at a time, process it, print out what you need to output, and then go to the next line? FH iterators are great, and input buffering is done under the surface anyway (unless you take steps to prevent it).

[reply]

Re: Processing large files
by Anonymous Monk on Aug 21, 2013 at 06:45 UTC

Basic debugging checklist

doc

[reply]

Re: Processing large files
by Preceptor (Deacon) on Aug 21, 2013 at 19:06 UTC

I can't actually understand why you're trying to read line by line, and then batch process every 3000. Are the data in those 3000 lines in some way correlated? Otherwise, you're not really doing much good - a 'while' loop will do what you want without needing to buffer anything.

while ( my $line = <Input> )
{ 
   #do stuff; print output
}
[download]

It does depend a little though, what some of your loops are doing. But a 'while' based traverse of a file won't read the whole file all at once (unless you deliberately 'make it' do that).

[reply]
[d/l]

Re: Processing large files
by zork42 (Monk) on Aug 22, 2013 at 06:07 UTC

Whenever you process anything in "chunks" > 1 thing, always remember to process the final partial chunk (if it exists).

[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks