Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: segregating data from one file to many files

by dwm042 (Priest)
on May 08, 2009 at 19:15 UTC ( [id://762915]=note: print w/replies, xml ) Need Help??


in reply to segregating data from one file to many files

I think the mistake in the code is that you're reading the input file multiple times without opening and closing it, but beyond that, the program itself is inefficient because it's attempting to read the data many times.

I'd rewrite the program to read the data once into a hash and then write many times, something like this:

#! /usr/bin/perl use warnings; use strict; my %hash; while(<DATA>) { chomp; my @col = split " ", $_; next unless exists $col[3]; next unless $col[3] =~ /^S\d{5}GM\d{3}$/; my $key = substr($col[3],1,5); push @{$hash{$key}} , [ @col ]; } for ( sort keys %hash ) { my $i = $_; $i =~ s/^0+//g; my $file = "output_$i.txt"; # open FILE, ">", "$file" or # die("Cannot open file $file\n"); print "FILE: $file\n"; for my $col ( @{$hash{$_}} ) { print join (" ", @$col), "\n"; # print FILE join (" ", @$col), "\n"; } # close (FILE); } __DATA__ 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp|V023334 desc 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc
Please note the commented out open, close, and print statements have not been tested.

The results are:

C:\Perl>perl onfour.pl FILE: output_1.txt 15525 18026 - S00001GM001 sml_032 sp|V023334 desc FILE: output_2.txt 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc FILE: output_3.txt 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc FILE: output_10.txt 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc FILE: output_12.txt 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc FILE: output_13.txt 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc

Replies are listed 'Best First'.
Re^2: segregating data from one file to many files
by patric (Acolyte) on May 09, 2009 at 18:55 UTC
    hi, thank you very much for the program. Since, the script which i wrote was very very slow, I tried using the one which you have given. It works fine, but,in the line
    print FILE join (" ", @$col), "\n";
    i tried replacing it with tab,
    print FILE join ("\t", @$col), "\n";
    It works fine. But i am sorry to say this so late, that the last column "desc" is sometimes more than one word. So, its like a sentence "desc not found". Moreover, in my original input text file, the columns are separated by "Tab". In this case, when i join the @$cols with \t, the "desc not found" is printed separated with tabs instead of a single space. Since only the last column is prone to have a sentence kind of data in it, the rest of the columns are perfectly alright except for the last column. How do i over come this problem? Please help. Thank you once again :)
      The simplest way to overcome the problem you encounter is to count the number of fields in @$col and handle the last few terms separately.

      #! /usr/bin/perl use warnings; use strict; my %hash; while(<DATA>) { chomp; my @col = split " ", $_; next unless exists $col[3]; next unless $col[3] =~ /^S\d{5}GM\d{3}$/; my $key = substr($col[3],1,5); push @{$hash{$key}} , [ @col ]; } for ( sort keys %hash ) { my $i = $_; $i =~ s/^0+//g; my $file = "output_$i.txt"; # open FILE, ">", "$file" or # die("Cannot open file $file\n"); print "FILE: $file\n"; for my $col ( @{$hash{$_}} ) { my $col_count = scalar @$col - 1; if ( $col_count > 6 ) { # # use array slices to partition data. # my $end = join " ", @$col[6..$col_count]; my $begin = join "\t", @$col[0..5]; print $begin, "\t", $end, "\n"; } else { print join ("\t", @$col), "\n"; } # replace prints with print FILE etc. } # close (FILE); } __DATA__ 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc 15525 18026 - S00001GM001 sml_032 sp|V023334 desc 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc +not found 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc +not found 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc
      The output is:

      C:\Code>perl onfour.pl FILE: output_1.txt 15525 18026 - S00001GM001 sml_032 sp|V023334 desc FILE: output_2.txt 32763 34239 + S00002GM001 sml_028 sp|YV02376 desc FILE: output_3.txt 67929 68933 - S00003GM001 sml_025 sp|YV02346 desc n +ot found FILE: output_10.txt 11880 13417 - S00010GM001 sml_056 sp|YV02233 desc 13804 14685 - S00010GM002 sml_045 sp|YV02643 desc FILE: output_12.txt 90562 91368 + S00012GM001 sml_025 sp|YV02376 desc n +ot found 10209 10433 - S00012GM002 sml_046 sp|YV02355 desc FILE: output_13.txt 12522 12576 + S00013GM001 sml_027 sp|0235777 desc 13247 13349 - S00013GM002 sml_088 sp|YV02375 desc

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://762915]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-19 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found