Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: segregating data from one file to many files

by roboticus (Chancellor)
on May 08, 2009 at 21:45 UTC ( [id://762953]=note: print w/replies, xml ) Need Help??


in reply to segregating data from one file to many files

patric:

As you mention later, you *could* just reopen the input file inside your loop. But you'll have to scan through the file once per output file. Or you could open all your file handles at the beginning, or store the data in an array or hash and rescan it from memory. But I usually prefer to do it another way. Why? A big file could take a significant amount of time to scan repeatedly. Storing it in memory could exceed your memory limits. Opening the file handles up front requires you to know all possible file names at the beginning.

What I do is open the output files as I need them. Suppose you had a function get_file_handle that would always give you the correct file handle to output the line to. Then your main loop would simplify to the following (after trimming out some unused variables & such):

#!/usr/bin/perl use strict; use warnings; open(FH,"input.txt")or die "can not open input file\n"; while (my $line=<FH>) { my (undef, undef, undef, $four, undef) = split("\t",$line); if ($four=~m/S(\d+)GM/){ my $F = get_file_handle($1); print $F $line; } }

So all we need is that function. It turns out to be surprisingly simple:

my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }

As you can see, we just store our file handles in a hash. If the key (00001, 00012, etc.) is a value the function has never seen before, it opens a new output file, and tucks it away in the hash. Then it returns a copy of the file handle from the hash.

...roboticus

Replies are listed 'Best First'.
Re^2: segregating data from one file to many files
by patric (Acolyte) on May 09, 2009 at 18:37 UTC
    thanks for you suggestion. As you said, my program takes a lot of time. It has to write 35000 files and there are 400,000 lines in the input text file. I tried the program which you have given. It throws an error "Too many files open". why is it so? Thank you once again.
      patric:

      Yowch! You're probably hitting an OS limit on the number of file handles you can have open. If using one of the methods of reading the file into a hash doesn't run out of RAM, you'll want to use one of those. Otherwise, you'll have to modify the get_file_handle function to close some of its file handles when it's about to run out. Just as a quick off-the cuff thing, it might1 be good enough to simply close *all* the file handles when you reach some predetermined limit. Something like (untested):

      my $Max_FH=1000; # Maximum # file handles you want my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { if ($Max_FH <= keys %FHList) { close $FHList{$_} for keys %FHList; %FHList=(); } open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }

      1 Some workloads have a handful of commonly-used tags, and a mess of onesie/twosies. If that's the case, this will occasionally close and reopen commonly-used tags, but it'll clear out all the lesser-used ones. If the commonly-used values are common enough, the opens and closes will amortize to a small amount of overhead. If your workload has an evenly-distributed set of keys, then you'll need to make get_file_handle much smarter...

      ...roboticus

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://762953]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-04-25 02:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found