Splitting a large file into smaller files according to indexes

cryptoperl has asked for the wisdom of the Perl Monks concerning the following question:

I have a large file around 1.4 G. I am having trouble parsing it line by line and storing it in a single data structure (such as hash of arrays). My file is like this :

bos-mp96h:~ jvx$ head asgn.txt  
ra_uuid: a37bbde8-36ba-11e8-a697-00e081ea0e98
cms_uuid: 2d937c7e-36ba-11e8-91f1-00e081ea0e8e
mpd_uuid: 6edd7a68-36b0-11e8-a120-00e081ea0e5c
amLeader: 1
numAssignments = 20956857
mpg=1 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40
mpg=2 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40
mpg=3 mrule=150 reg=7989 score=0 rank=0 perc=100 mp_demand=20
mpg=4 mrule=150 reg=7989 score=10625 rank=0 perc=100 mp_demand=40
[download]

So what I am rather doing is that making different file for each index and then process that. What I want is a different file for every other "mrule". So for the above snippet, I will have two files, each for mrule 140 and mrule 150. My code for it is as follows :

#! /usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper ; 

my $source = shift ;
my $lines_per_file = shift ; 

 open (my $FH, "<$source") or die "Could not open source file. $!";
 open (my $OUT, '>', '00000000.log') or die "Could not open destinatio
+n file. $!";

my $i = 0;
my $index_last = 0 ;
my $index_current = 0;  

while(my $line = <$FH>) {
    next unless ($line =~ /mrule/) ; 
    if ($line =~ /mrule=([0-9]+)/){
        print $OUT $line; 
        $i++ ;

        if ($1 != $index_last){
                $index_current = $1 ; 
                close($OUT); 
                my $NEW = sprintf("%08d", $index_current);
                open($OUT, ">${NEW}.log") or die "Could not open desti
+nation file. $! " ; 
                } 
                $index_last = $index_current ; 
    }
}

close($FH);
close($OUT);
[download]

But when I run this, I get this following error,

bos-mp96h:~ jvx$ ./partition_file.pl asgn.txt  
Can't use string ("00000140") as a symbol ref while "strict refs" in u
+se at ./partition_file.pl line 26, <$FH> line 6.
[download]

Comment on Splitting a large file into smaller files according to indexes Select or Download Code

Replies are listed 'Best First'.
Re: Splitting a large file into smaller files according to indexes by choroba (Cardinal) on Apr 05, 2018 at 16:10 UTC
The first argument to open is a file handle or file handle reference. You provided $NEW, which is populated on the previous line by sprintf which returns a string. Did you mean `open $OUT, '>', $NEW or die ...` [download] ? ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: Splitting a large file into smaller files according to indexes by cryptoperl (Novice) on Apr 05, 2018 at 17:27 UTC
The first argument to open is the file itself "asgn.txt" thank you for pointing this out and yes I meant this `open $OUT, '>', $NEW or die ...` If the program runs properly, but what I am missing is a the first line from the first index and I am getting an extra line for the next index. Like this `bos-mp96h:~ jvx$ head -1 00000140.log mpg=2 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 bos-mp96h:~ jvx$ tail -1 00000140.log mpg=1 mrule=3 reg=6346 score=10625 rank=0 perc=100 mp_demand=1` [download] I might have to reiterate through my logic, but any suggestions would be helpful :)	[reply] [d/l] [select]
Re: Splitting a large file into smaller files according to indexes by bliako (Monsignor) on Apr 05, 2018 at 20:03 UTC
I have commented some lines in your program and added some others as follows: #!/usr/bin/env perl use strict; use warnings; use Data::Dumper ; my $source = shift ; my $lines_per_file = shift ; open (my $FH, "<$source") or die "Could not open source file. $!"; # open (my $OUT, '>', '00000000.log') or die "Could not open destinati +on file. $!"; my $OUT = undef; my $i = 0; #my $index_last = 0 ; my $index_current = -1; while(my $line = <$FH>) { next unless ($line =~ /mrule/) ; if ($line =~ /mrule=([0-9]+)/){ if( $index_current != $1 ){ $index_current = $1; if( defined($OUT) ){ close($OUT); } my $NEW = sprintf("%08d", $index_current); open($OUT, ">${NEW}.log") or die "Could not open destinati +on file. $! " ; } print $OUT $line; $i++ ; # if ($1 != $index_last){ # $index_current = $1 ; # close($OUT); # my $NEW = sprintf("%08d", $index_current); # open($OUT, ">${NEW}.log") or die "Could not open destinat +ion file. $! " ; # } # $index_last = $index_current ; } } close($FH); #close($OUT); if( defined($OUT) ){ close($OUT); } [download] The above will be looking for an `mrule=[0-9]+` pattern in the input. Once it finds one, it will check if the current index is the same as the one in the line and if not, it will close current filehandle and open another one with the new name. After that, it will print to the filehandle currently opened. Note that no filehandle is opened unless the pattern in the input appears. tested with the minimal file you had provided. bliako	[reply] [d/l] [select]
Re^2: Splitting a large file into smaller files according to indexes by cryptoperl (Novice) on Apr 05, 2018 at 20:43 UTC
Works like a charm, thank you. How do I upvote this answer, sorry I am a newbie to perlmonks :) Also, why do we initialize, $current_index as -1?	[reply]
Re^3: Splitting a large file into smaller files according to indexes by hippo (Bishop) on Apr 05, 2018 at 20:48 UTC
See the section "How do I vote?" in Voting/Experience System.	[reply]
Re^4: Splitting a large file into smaller files according to indexes by cryptoperl (Novice) on Apr 05, 2018 at 21:50 UTC
Re^5: Splitting a large file into smaller files according to indexes by AnomalousMonk (Archbishop) on Apr 05, 2018 at 22:02 UTC
Re^4: Splitting a large file into smaller files according to indexes by cryptoperl (Novice) on Apr 05, 2018 at 21:49 UTC
Re^3: Splitting a large file into smaller files according to indexes by bliako (Monsignor) on Apr 05, 2018 at 23:32 UTC
it is initialised to -1 (or whaterver other value your mrule number will not take) in order to force the opening of the file upon seeing the mrule pattern for the first time. Glad it worked (it's Perl after all) but I stress that it is untested by me for more complex cases. bliako	[reply]
Re^4: Splitting a large file into smaller files according to indexes by bliako (Monsignor) on Apr 06, 2018 at 12:23 UTC
Re: Splitting a large file into smaller files according to indexes by shmem (Chancellor) on Apr 06, 2018 at 08:43 UTC
Can't use string ("00000140") as a symbol ref while "strict refs" in use at ./partition_file.pl line 26, <$FH> line 6. The code you posted doesn't produce this error message, it most likely stems from an earlier version of your program, in which you e.g. used $OUT overall instead of $NEW. In this case, in line 26, you would be trying to use a string as a filehandle. Are you sure that your "mrule" tokens are grouped? If after mrule=150 some line containing mrule=140 is found further down in the source file, you overwrite the file 00000140.log with that line. To avoid that, open a file for every mrule token and store the open filehandles in a hash: `my %fh; while ( my $line = <$FH> ) { if ( $line =~ /mrule=([0-9]+)/){ my $mrule = sprintf("%08d.log", $1); if ( ! $fh{$mrule} ) { open my $fh, '>', $mrule or die "Can't write to '$mrule': +$!\n"; $fh{$mrule} = $fh; } print {$fh{$mrule}} $line; } } close $fh{$_} for keys %fh; # close all files` [download] Note the extra braces in the print statement around `$fh{$mrule}`. These are necessary to disambiguate `$fh{$mrule}` as being a filehandle rather than part of the LIST for print. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^2: Splitting a large file into smaller files according to indexes by cryptoperl (Novice) on Apr 10, 2018 at 20:38 UTC
> Are you sure that your "mrule" tokens are grouped ? Yes, I am sure the "mrule" tokens are grouped , You would not see something like this `mpg=1 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=100 mpg=2 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 mpg=3 mrule=150 reg=7989 score=0 rank=0 perc=100 mp_demand=20 mpg=4 mrule=150 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 mpg=3 mrule=140 reg=7989 score=0 rank=0 perc=100 mp_demand=20` [download] That is, we would not see a line with "mrule=140" repeat before and after a different maprule. I tried running your code, but probably it is hitting the file limit. After creating 252 files, I get an error as `bos-mp96h:~ jvx$ ./partition_file.pl asgn.txt Can't write to '00000021.log': Too many open files bos-mp96h:~ jvx$ ls -ltr \| grep 'log' \| wc -l 252` [download]	[reply] [d/l] [select]
Re^3: Splitting a large file into smaller files according to indexes by shmem (Chancellor) on Apr 12, 2018 at 11:31 UTC
After creating 252 files, I get an error as ... Summing STDOUT,STDERR,STDIN to 252 gives 255... very tight limits you've got on your machine. What OS are you running? perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply]
Re^4: Splitting a large file into smaller files according to indexes by cryptoperl (Novice) on Apr 17, 2018 at 20:47 UTC
Re^5: Splitting a large file into smaller files according to indexes by shmem (Chancellor) on Apr 18, 2018 at 15:10 UTC

Back to Seekers of Perl Wisdom