Re: segregating data from one file to many files

patric:

As you mention later, you *could* just reopen the input file inside your loop. But you'll have to scan through the file once per output file. Or you could open all your file handles at the beginning, or store the data in an array or hash and rescan it from memory. But I usually prefer to do it another way. Why? A big file could take a significant amount of time to scan repeatedly. Storing it in memory could exceed your memory limits. Opening the file handles up front requires you to know all possible file names at the beginning.

What I do is open the output files as I need them. Suppose you had a function get_file_handle that would always give you the correct file handle to output the line to. Then your main loop would simplify to the following (after trimming out some unused variables & such):

#!/usr/bin/perl
use strict;
use warnings;

open(FH,"input.txt")or die "can not open input file\n";
while (my $line=<FH>) {
    my (undef, undef, undef, $four, undef) = split("\t",$line);
    if ($four=~m/S(\d+)GM/){
        my $F = get_file_handle($1);
        print $F $line;
    }
}
[download]

So all we need is that function. It turns out to be surprisingly simple:

my %FHList;  # Holds file handles we've opened so far
sub get_file_handle {
    my $key = shift;
    if (!exists $FHList{$key}) {
        open $FHList{$key}, '>', "output_$key.txt" or die $!;
    }
    return $FHList{$key};
}
[download]

As you can see, we just store our file handles in a hash. If the key (00001, 00012, etc.) is a value the function has never seen before, it opens a new output file, and tucks it away in the hash. Then it returns a copy of the file handle from the hash.

...roboticus

Comment on Re: segregating data from one file to many files Select or Download Code

Replies are listed 'Best First'.
Re^2: segregating data from one file to many files by patric (Acolyte) on May 09, 2009 at 18:37 UTC
thanks for you suggestion. As you said, my program takes a lot of time. It has to write 35000 files and there are 400,000 lines in the input text file. I tried the program which you have given. It throws an error "Too many files open". why is it so? Thank you once again.	[reply]
Re^3: segregating data from one file to many files by Corion (Patriarch) on May 09, 2009 at 18:41 UTC
Also see part - split up files according to column value together with the addition made by jdporter to make it handle arbitrarily many file handles.	[reply]
Re^3: segregating data from one file to many files by roboticus (Chancellor) on May 11, 2009 at 22:46 UTC
patric: Yowch! You're probably hitting an OS limit on the number of file handles you can have open. If using one of the methods of reading the file into a hash doesn't run out of RAM, you'll want to use one of those. Otherwise, you'll have to modify the `get_file_handle` function to close some of its file handles when it's about to run out. Just as a quick off-the cuff thing, it might¹ be good enough to simply close all the file handles when you reach some predetermined limit. Something like (untested): `my $Max_FH=1000; # Maximum # file handles you want my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { if ($Max_FH <= keys %FHList) { close $FHList{$_} for keys %FHList; %FHList=(); } open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }` [download] ¹ Some workloads have a handful of commonly-used tags, and a mess of onesie/twosies. If that's the case, this will occasionally close and reopen commonly-used tags, but it'll clear out all the lesser-used ones. If the commonly-used values are common enough, the opens and closes will amortize to a small amount of overhead. If your workload has an evenly-distributed set of keys, then you'll need to make `get_file_handle` much smarter... ...roboticus	[reply] [d/l] [select]


laziness, impatience, and hubris
	PerlMonks