Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

segregating data from one file to many files

by patric (Acolyte)
on May 08, 2009 at 18:32 UTC ( #762908=perlquestion: print w/replies, xml ) Need Help??

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have an input file with 7 columns. Now, i am concerned only about the 4th column because, based on that column, i will have to segregate each and every line to specific result files.
input file: 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp|V023334 desc 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc
The 4th column is designed like this. "S","5 digit number","GM","3 digit number". the five digit number ranges from 1-35000. the three digit number ranges from 1-999. Now, my only concern is about the five digit number,because i have create an output file on the five digit number(excluding the preceeding zeros) and write all the lines corresponding to that id to the output file. before you get confused on my query, i will tell you an example. In the sample input file i have given above, the five digit number after "S" ranges from 1-13.(which is given as 00001 to 00013. thats how the format is.)So, i will have to generate 13 different output files and segregate the lines belonging to that id to their output files. example:
sample output files: filename:output_1.txt 15525 18026 - S00001GM001 sml_032 sp|V023334 desc filename:output_2.txt 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc filename:output_3.txt 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc filename:output_4.txt NO HITS files 5 to 9 = same as above(like output 4, because there are no S0000 +5 files found, neither the others till S00009). filename:output_10.txt 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc files 11 and 12 = same as above (like output 10, because there are hit +s found with S00011 and S00012).
Hope the query is clear now. But my program gives me, all empty files :( it writes only one output file correctly. Only the first output file. The rest are empty.Here is my program:
#!/usr/bin/perl use strict; use warnings; open(FH,"input.txt")or die "can not open input file\n"; for(my $i=1;$i<=15;$i++){ open(OUT,">output_$i.txt") or die "can not create new files\n"; my $pattern=sprintf '%05s',$i; $pattern="S".$pattern."GM"; my $c=0;my $search; while(my $line=<FH>){ my($one,$two,$three,$four,$five,$six,$seven)=split("\t",$line) +; if($four=~m/(S\d+GM)/){ $search=$1; $search=~s/\s+//g; } if($search=~m/$pattern/){ print OUT "$line"; $c++; } } if($c==0){ print OUT "NO HITS\n"; } $pattern=();$search=(); }
where am i going wrong??? please help.

Replies are listed 'Best First'.
Re: segregating data from one file to many files
by ikegami (Pope) on May 08, 2009 at 18:40 UTC
    my $line=<FH> reads a line from the current position in the file. It doesn't start reading from the start of the file once you've reached the end. Reopen the file or seek to the beginning.
      thank you :) i opened the file inside the for loop. It works well now. Thanks once again :)
Re: segregating data from one file to many files
by roboticus (Chancellor) on May 08, 2009 at 21:45 UTC

    patric:

    As you mention later, you *could* just reopen the input file inside your loop. But you'll have to scan through the file once per output file. Or you could open all your file handles at the beginning, or store the data in an array or hash and rescan it from memory. But I usually prefer to do it another way. Why? A big file could take a significant amount of time to scan repeatedly. Storing it in memory could exceed your memory limits. Opening the file handles up front requires you to know all possible file names at the beginning.

    What I do is open the output files as I need them. Suppose you had a function get_file_handle that would always give you the correct file handle to output the line to. Then your main loop would simplify to the following (after trimming out some unused variables & such):

    #!/usr/bin/perl use strict; use warnings; open(FH,"input.txt")or die "can not open input file\n"; while (my $line=<FH>) { my (undef, undef, undef, $four, undef) = split("\t",$line); if ($four=~m/S(\d+)GM/){ my $F = get_file_handle($1); print $F $line; } }

    So all we need is that function. It turns out to be surprisingly simple:

    my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }

    As you can see, we just store our file handles in a hash. If the key (00001, 00012, etc.) is a value the function has never seen before, it opens a new output file, and tucks it away in the hash. Then it returns a copy of the file handle from the hash.

    ...roboticus
      thanks for you suggestion. As you said, my program takes a lot of time. It has to write 35000 files and there are 400,000 lines in the input text file. I tried the program which you have given. It throws an error "Too many files open". why is it so? Thank you once again.
        patric:

        Yowch! You're probably hitting an OS limit on the number of file handles you can have open. If using one of the methods of reading the file into a hash doesn't run out of RAM, you'll want to use one of those. Otherwise, you'll have to modify the get_file_handle function to close some of its file handles when it's about to run out. Just as a quick off-the cuff thing, it might1 be good enough to simply close *all* the file handles when you reach some predetermined limit. Something like (untested):

        my $Max_FH=1000; # Maximum # file handles you want my %FHList; # Holds file handles we've opened so far sub get_file_handle { my $key = shift; if (!exists $FHList{$key}) { if ($Max_FH <= keys %FHList) { close $FHList{$_} for keys %FHList; %FHList=(); } open $FHList{$key}, '>', "output_$key.txt" or die $!; } return $FHList{$key}; }

        1 Some workloads have a handful of commonly-used tags, and a mess of onesie/twosies. If that's the case, this will occasionally close and reopen commonly-used tags, but it'll clear out all the lesser-used ones. If the commonly-used values are common enough, the opens and closes will amortize to a small amount of overhead. If your workload has an evenly-distributed set of keys, then you'll need to make get_file_handle much smarter...

        ...roboticus
Re: segregating data from one file to many files
by jwkrahn (Monsignor) on May 08, 2009 at 19:22 UTC

    You can just open all the output files first.   Something like this:

    #!/usr/bin/perl use strict; use warnings; my %files; for my $i ( '01' .. '15' ) { open my $OUT, '>', "output_$i.txt" or die "can not create 'output_ +$i.txt' $!\n"; my $pattern = sprintf 'S%05dGM', $i; $files{ $pattern } = { fh => $OUT, count => 0 }; } open my $FH, '<', 'input.txt' or die "can not open 'input.txt' $!\n"; while ( my $line = <$FH> ) { my $four = ( split /\t/, $line )[ 3 ]; my $key = substr $four, 0, 8; if ( exists $files{ $key } ) { print { $files{ $key }{ fh } } $line; $files{ $key }{ count }++; } } close $FH; for my $key ( keys %files ) { unless ( $files{ $key }{ count } ) { print { $files{ $key }{ fh } } "NO HITS\n"; } }
Re: segregating data from one file to many files
by dwm042 (Priest) on May 08, 2009 at 19:15 UTC
    I think the mistake in the code is that you're reading the input file multiple times without opening and closing it, but beyond that, the program itself is inefficient because it's attempting to read the data many times.

    I'd rewrite the program to read the data once into a hash and then write many times, something like this:

    #! /usr/bin/perl use warnings; use strict; my %hash; while(<DATA>) { chomp; my @col = split " ", $_; next unless exists $col[3]; next unless $col[3] =~ /^S\d{5}GM\d{3}$/; my $key = substr($col[3],1,5); push @{$hash{$key}} , [ @col ]; } for ( sort keys %hash ) { my $i = $_; $i =~ s/^0+//g; my $file = "output_$i.txt"; # open FILE, ">", "$file" or # die("Cannot open file $file\n"); print "FILE: $file\n"; for my $col ( @{$hash{$_}} ) { print join (" ", @$col), "\n"; # print FILE join (" ", @$col), "\n"; } # close (FILE); } __DATA__ 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp|V023334 desc 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc
    Please note the commented out open, close, and print statements have not been tested.

    The results are:

    C:\Perl>perl onfour.pl FILE: output_1.txt 15525 18026 - S00001GM001 sml_032 sp|V023334 desc FILE: output_2.txt 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc FILE: output_3.txt 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc FILE: output_10.txt 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc FILE: output_12.txt 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc FILE: output_13.txt 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc
      hi, thank you very much for the program. Since, the script which i wrote was very very slow, I tried using the one which you have given. It works fine, but,in the line
      print FILE join (" ", @$col), "\n";
      i tried replacing it with tab,
      print FILE join ("\t", @$col), "\n";
      It works fine. But i am sorry to say this so late, that the last column "desc" is sometimes more than one word. So, its like a sentence "desc not found". Moreover, in my original input text file, the columns are separated by "Tab". In this case, when i join the @$cols with \t, the "desc not found" is printed separated with tabs instead of a single space. Since only the last column is prone to have a sentence kind of data in it, the rest of the columns are perfectly alright except for the last column. How do i over come this problem? Please help. Thank you once again :)
        The simplest way to overcome the problem you encounter is to count the number of fields in @$col and handle the last few terms separately.

        #! /usr/bin/perl use warnings; use strict; my %hash; while(<DATA>) { chomp; my @col = split " ", $_; next unless exists $col[3]; next unless $col[3] =~ /^S\d{5}GM\d{3}$/; my $key = substr($col[3],1,5); push @{$hash{$key}} , [ @col ]; } for ( sort keys %hash ) { my $i = $_; $i =~ s/^0+//g; my $file = "output_$i.txt"; # open FILE, ">", "$file" or # die("Cannot open file $file\n"); print "FILE: $file\n"; for my $col ( @{$hash{$_}} ) { my $col_count = scalar @$col - 1; if ( $col_count > 6 ) { # # use array slices to partition data. # my $end = join " ", @$col[6..$col_count]; my $begin = join "\t", @$col[0..5]; print $begin, "\t", $end, "\n"; } else { print join ("\t", @$col), "\n"; } # replace prints with print FILE etc. } # close (FILE); } __DATA__ 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp|V023334 desc 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc +not found 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc +not found 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc
        The output is:

        C:\Code>perl onfour.pl FILE: output_1.txt 15525 18026 - S00001GM001 sml_032 sp|V023334 desc FILE: output_2.txt 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc FILE: output_3.txt 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc n +ot found FILE: output_10.txt 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc FILE: output_12.txt 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc n +ot found 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc FILE: output_13.txt 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://762908]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2021-12-08 23:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (36 votes). Check out past polls.

    Notices?