segregating data from one file to many files

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have an input file with 7 columns. Now, i am concerned only about the 4th column because, based on that column, i will have to segregate each and every line to specific result files.

input file:
11880    13417    -    S00010GM001    sml_056    sp|YV02233      desc
13804    14685    -    S00010GM002    sml_045    sp|YV02643      desc
15525    18026    -    S00001GM001    sml_032    sp|V023334      desc
32763    34239    +    S00002GM001    sml_028    sp|YV02376      desc
67929    68933    -    S00003GM001    sml_025    sp|YV02346      desc
90562    91368    +    S00012GM001    sml_025    sp|YV02376      desc
10209    10433    -    S00012GM002    sml_046    sp|YV02355      desc
12522    12576    +    S00013GM001    sml_027    sp|0235777      desc
13247    13349    -    S00013GM002    sml_088    sp|YV02375      desc
[download]

The 4th column is designed like this. "S","5 digit number","GM","3 digit number". the five digit number ranges from 1-35000. the three digit number ranges from 1-999. Now, my only concern is about the five digit number,because i have create an output file on the five digit number(excluding the preceeding zeros) and write all the lines corresponding to that id to the output file. before you get confused on my query, i will tell you an example. In the sample input file i have given above, the five digit number after "S" ranges from 1-13.(which is given as 00001 to 00013. thats how the format is.)So, i will have to generate 13 different output files and segregate the lines belonging to that id to their output files. example:

sample output files:

filename:output_1.txt
15525    18026    -    S00001GM001    sml_032    sp|V023334      desc

filename:output_2.txt
32763    34239    +    S00002GM001    sml_028    sp|YV02376      desc

filename:output_3.txt
67929    68933    -    S00003GM001    sml_025    sp|YV02346      desc

filename:output_4.txt
NO HITS

files 5 to 9 = same as above(like output 4, because there are no S0000
+5 files found, neither the others till S00009).

filename:output_10.txt
11880    13417    -    S00010GM001    sml_056    sp|YV02233      desc
13804    14685    -    S00010GM002    sml_045    sp|YV02643      desc

files 11 and 12 = same as above (like output 10, because there are hit
+s found with S00011 and S00012).
[download]

Hope the query is clear now. But my program gives me, all empty files :( it writes only one output file correctly. Only the first output file. The rest are empty.Here is my program:

#!/usr/bin/perl
use strict;
use warnings;
open(FH,"input.txt")or die "can not open input file\n";

for(my $i=1;$i<=15;$i++){
open(OUT,">output_$i.txt") or die "can not create new files\n";
my $pattern=sprintf '%05s',$i;
$pattern="S".$pattern."GM";
        my $c=0;my $search;
        while(my $line=<FH>){
        my($one,$two,$three,$four,$five,$six,$seven)=split("\t",$line)
+;
        if($four=~m/(S\d+GM)/){
            $search=$1;
            $search=~s/\s+//g;
        }
        if($search=~m/$pattern/){
            print OUT "$line";
            $c++;
        }
    }
    if($c==0){
            print OUT "NO HITS\n";
    }
$pattern=();$search=();
}
[download]

where am i going wrong??? please help.

Comment on segregating data from one file to many files Select or Download Code

Replies are listed 'Best First'.
Re: segregating data from one file to many files by ikegami (Patriarch) on May 08, 2009 at 18:40 UTC
`my $line=<FH>` reads a line from the current position in the file. It doesn't start reading from the start of the file once you've reached the end. Reopen the file or `seek` to the beginning.	[reply] [d/l] [select]
Re^2: segregating data from one file to many files by patric (Acolyte) on May 08, 2009 at 18:50 UTC
thank you :) i opened the file inside the for loop. It works well now. Thanks once again :)	[reply]
Re: segregating data from one file to many files by roboticus (Chancellor) on May 08, 2009 at 21:45 UTC
patric: As you mention later, you could just reopen the input file inside your loop. But you'll have to scan through the file once per output file. Or you could open all your file handles at the beginning, or store the data in an array or hash and rescan it from memory. But I usually prefer to do it another way. Why? A big file could take a significant amount of time to scan repeatedly. Storing it in memory could exceed your memory limits. Opening the file handles up front requires you to know all possible file names at the beginning. What I do is open the output files as I need them. Suppose you had a function `get_file_handle` that would always give you the correct file handle to output the line to. Then your main loop would simplify to the following (after trimming out some unused variables & such): `#!/usr/bin/perl use strict; use warnings; open(FH,"input.txt")or die "can not open input file\n"; while (my $line=<FH>) { my (undef, undef, undef, $four, undef) = split("\t",$line); if ($four=~m/S(\d+)GM/){ my $F = get_file_handle($1); print $F $line; } }` [download] So all we need is that function. It turns out to be surprisingly simple: `my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }` [download] As you can see, we just store our file handles in a hash. If the key (00001, 00012, etc.) is a value the function has never seen before, it opens a new output file, and tucks it away in the hash. Then it returns a copy of the file handle from the hash. ...roboticus	[reply] [d/l] [select]
Re^2: segregating data from one file to many files by patric (Acolyte) on May 09, 2009 at 18:37 UTC
thanks for you suggestion. As you said, my program takes a lot of time. It has to write 35000 files and there are 400,000 lines in the input text file. I tried the program which you have given. It throws an error "Too many files open". why is it so? Thank you once again.	[reply]
Re^3: segregating data from one file to many files by Corion (Patriarch) on May 09, 2009 at 18:41 UTC
Also see part - split up files according to column value together with the addition made by jdporter to make it handle arbitrarily many file handles.	[reply]
Re^3: segregating data from one file to many files by roboticus (Chancellor) on May 11, 2009 at 22:46 UTC
patric: Yowch! You're probably hitting an OS limit on the number of file handles you can have open. If using one of the methods of reading the file into a hash doesn't run out of RAM, you'll want to use one of those. Otherwise, you'll have to modify the `get_file_handle` function to close some of its file handles when it's about to run out. Just as a quick off-the cuff thing, it might¹ be good enough to simply close all the file handles when you reach some predetermined limit. Something like (untested): `my $Max_FH=1000; # Maximum # file handles you want my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { if ($Max_FH <= keys %FHList) { close $FHList{$_} for keys %FHList; %FHList=(); } open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }` [download] ¹ Some workloads have a handful of commonly-used tags, and a mess of onesie/twosies. If that's the case, this will occasionally close and reopen commonly-used tags, but it'll clear out all the lesser-used ones. If the commonly-used values are common enough, the opens and closes will amortize to a small amount of overhead. If your workload has an evenly-distributed set of keys, then you'll need to make `get_file_handle` much smarter... ...roboticus	[reply] [d/l] [select]
Re: segregating data from one file to many files by jwkrahn (Abbot) on May 08, 2009 at 19:22 UTC
You can just open all the output files first. Something like this: #!/usr/bin/perl use strict; use warnings; my %files; for my $i ( '01' .. '15' ) { open my $OUT, '>', "output_$i.txt" or die "can not create 'output_ +$i.txt' $!\n"; my $pattern = sprintf 'S%05dGM', $i; $files{ $pattern } = { fh => $OUT, count => 0 }; } open my $FH, '<', 'input.txt' or die "can not open 'input.txt' $!\n"; while ( my $line = <$FH> ) { my $four = ( split /\t/, $line )[ 3 ]; my $key = substr $four, 0, 8; if ( exists $files{ $key } ) { print { $files{ $key }{ fh } } $line; $files{ $key }{ count }++; } } close $FH; for my $key ( keys %files ) { unless ( $files{ $key }{ count } ) { print { $files{ $key }{ fh } } "NO HITS\n"; } } [download]	[reply] [d/l]
Re: segregating data from one file to many files by dwm042 (Priest) on May 08, 2009 at 19:15 UTC
I think the mistake in the code is that you're reading the input file multiple times without opening and closing it, but beyond that, the program itself is inefficient because it's attempting to read the data many times. I'd rewrite the program to read the data once into a hash and then write many times, something like this: #! /usr/bin/perl use warnings; use strict; my %hash; while(<DATA>) { chomp; my @col = split " ", $_; next unless exists $col[3]; next unless $col[3] =~ /^S\d{5}GM\d{3}$/; my $key = substr($col[3],1,5); push @{$hash{$key}} , [ @col ]; } for ( sort keys %hash ) { my $i = $_; $i =~ s/^0+//g; my $file = "output_$i.txt"; # open FILE, ">", "$file" or # die("Cannot open file $file\n"); print "FILE: $file\n"; for my $col ( @{$hash{$_}} ) { print join (" ", @$col), "\n"; # print FILE join (" ", @$col), "\n"; } # close (FILE); } __DATA__ 11880 13417 - S00010GM001 sml_056 sp\|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp\|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp\|V023334 desc 32763 34239 + S00002GM001 sml_028 sp\|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp\|YV02346 desc 90562 91368 + S00012GM001 sml_025 sp\|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp\|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp\|0235777 desc 13247 13349 - S00013GM002 sml_088 sp\|YV02375 desc [download] Please note the commented out open, close, and print statements have not been tested. The results are: C:\Perl>perl onfour.pl FILE: output_1.txt 15525 18026 - S00001GM001 sml_032 sp\|V023334 desc FILE: output_2.txt 32763 34239 + S00002GM001 sml_028 sp\|YV02376 desc FILE: output_3.txt 67929 68933 - S00003GM001 sml_025 sp\|YV02346 desc FILE: output_10.txt 11880 13417 - S00010GM001 sml_056 sp\|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp\|YV02643 desc FILE: output_12.txt 90562 91368 + S00012GM001 sml_025 sp\|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp\|YV02355 desc FILE: output_13.txt 12522 12576 + S00013GM001 sml_027 sp\|0235777 desc 13247 13349 - S00013GM002 sml_088 sp\|YV02375 desc [download]	[reply] [d/l] [select]
Re^2: segregating data from one file to many files by patric (Acolyte) on May 09, 2009 at 18:55 UTC
hi, thank you very much for the program. Since, the script which i wrote was very very slow, I tried using the one which you have given. It works fine, but,in the line `print FILE join (" ", @$col), "\n";` [download] i tried replacing it with tab, `print FILE join ("\t", @$col), "\n";` [download] It works fine. But i am sorry to say this so late, that the last column "desc" is sometimes more than one word. So, its like a sentence "desc not found". Moreover, in my original input text file, the columns are separated by "Tab". In this case, when i join the @$cols with \t, the "desc not found" is printed separated with tabs instead of a single space. Since only the last column is prone to have a sentence kind of data in it, the rest of the columns are perfectly alright except for the last column. How do i over come this problem? Please help. Thank you once again :)	[reply] [d/l] [select]
Re^3: segregating data from one file to many files by dwm042 (Priest) on May 11, 2009 at 21:53 UTC
The simplest way to overcome the problem you encounter is to count the number of fields in @$col and handle the last few terms separately. #! /usr/bin/perl use warnings; use strict; my %hash; while(<DATA>) { chomp; my @col = split " ", $_; next unless exists $col[3]; next unless $col[3] =~ /^S\d{5}GM\d{3}$/; my $key = substr($col[3],1,5); push @{$hash{$key}} , [ @col ]; } for ( sort keys %hash ) { my $i = $_; $i =~ s/^0+//g; my $file = "output_$i.txt"; # open FILE, ">", "$file" or # die("Cannot open file $file\n"); print "FILE: $file\n"; for my $col ( @{$hash{$_}} ) { my $col_count = scalar @$col - 1; if ( $col_count > 6 ) { # # use array slices to partition data. # my $end = join " ", @$col[6..$col_count]; my $begin = join "\t", @$col[0..5]; print $begin, "\t", $end, "\n"; } else { print join ("\t", @$col), "\n"; } # replace prints with print FILE etc. } # close (FILE); } __DATA__ 11880 13417 - S00010GM001 sml_056 sp\|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp\|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp\|V023334 desc 32763 34239 + S00002GM001 sml_028 sp\|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp\|YV02346 desc +not found 90562 91368 + S00012GM001 sml_025 sp\|YV02376 desc +not found 10209 10433 - S00012GM002 sml_046 sp\|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp\|0235777 desc 13247 13349 - S00013GM002 sml_088 sp\|YV02375 desc [download] The output is: C:\Code>perl onfour.pl FILE: output_1.txt 15525 18026 - S00001GM001 sml_032 sp\|V023334 desc FILE: output_2.txt 32763 34239 + S00002GM001 sml_028 sp\|YV02376 desc FILE: output_3.txt 67929 68933 - S00003GM001 sml_025 sp\|YV02346 desc n +ot found FILE: output_10.txt 11880 13417 - S00010GM001 sml_056 sp\|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp\|YV02643 desc FILE: output_12.txt 90562 91368 + S00012GM001 sml_025 sp\|YV02376 desc n +ot found 10209 10433 - S00012GM002 sml_046 sp\|YV02355 desc FILE: output_13.txt 12522 12576 + S00013GM001 sml_027 sp\|0235777 desc 13247 13349 - S00013GM002 sml_088 sp\|YV02375 desc [download]	[reply] [d/l] [select]