http://qs321.pair.com?node_id=1212354

cryptoperl has asked for the wisdom of the Perl Monks concerning the following question:

I have a large file around 1.4 G. I am having trouble parsing it line by line and storing it in a single data structure (such as hash of arrays). My file is like this :

bos-mp96h:~ jvx$ head asgn.txt ra_uuid: a37bbde8-36ba-11e8-a697-00e081ea0e98 cms_uuid: 2d937c7e-36ba-11e8-91f1-00e081ea0e8e mpd_uuid: 6edd7a68-36b0-11e8-a120-00e081ea0e5c amLeader: 1 numAssignments = 20956857 mpg=1 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 mpg=2 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 mpg=3 mrule=150 reg=7989 score=0 rank=0 perc=100 mp_demand=20 mpg=4 mrule=150 reg=7989 score=10625 rank=0 perc=100 mp_demand=40
So what I am rather doing is that making different file for each index and then process that. What I want is a different file for every other "mrule". So for the above snippet, I will have two files, each for mrule 140 and mrule 150. My code for it is as follows :

#! /usr/bin/perl -w use strict; use warnings; use Data::Dumper ; my $source = shift ; my $lines_per_file = shift ; open (my $FH, "<$source") or die "Could not open source file. $!"; open (my $OUT, '>', '00000000.log') or die "Could not open destinatio +n file. $!"; my $i = 0; my $index_last = 0 ; my $index_current = 0; while(my $line = <$FH>) { next unless ($line =~ /mrule/) ; if ($line =~ /mrule=([0-9]+)/){ print $OUT $line; $i++ ; if ($1 != $index_last){ $index_current = $1 ; close($OUT); my $NEW = sprintf("%08d", $index_current); open($OUT, ">${NEW}.log") or die "Could not open desti +nation file. $! " ; } $index_last = $index_current ; } } close($FH); close($OUT);
But when I run this, I get this following error,
bos-mp96h:~ jvx$ ./partition_file.pl asgn.txt Can't use string ("00000140") as a symbol ref while "strict refs" in u +se at ./partition_file.pl line 26, <$FH> line 6.

Replies are listed 'Best First'.
Re: Splitting a large file into smaller files according to indexes
by choroba (Archbishop) on Apr 05, 2018 at 16:10 UTC
    The first argument to open is a file handle or file handle reference. You provided $NEW, which is populated on the previous line by sprintf which returns a string. Did you mean
    open $OUT, '>', $NEW or die ...
    ?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      The first argument to open is the file itself "asgn.txt" thank you for pointing this out and yes I meant this

      open $OUT, '>', $NEW or die ...

      If the program runs properly, but what I am missing is a the first line from the first index and I am getting an extra line for the next index. Like this

      bos-mp96h:~ jvx$ head -1 00000140.log mpg=2 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 bos-mp96h:~ jvx$ tail -1 00000140.log mpg=1 mrule=3 reg=6346 score=10625 rank=0 perc=100 mp_demand=1
      I might have to reiterate through my logic, but any suggestions would be helpful :)
Re: Splitting a large file into smaller files according to indexes
by bliako (Prior) on Apr 05, 2018 at 20:03 UTC
    I have commented some lines in your program and added some others as follows:
    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper ; my $source = shift ; my $lines_per_file = shift ; open (my $FH, "<$source") or die "Could not open source file. $!"; # open (my $OUT, '>', '00000000.log') or die "Could not open destinati +on file. $!"; my $OUT = undef; my $i = 0; #my $index_last = 0 ; my $index_current = -1; while(my $line = <$FH>) { next unless ($line =~ /mrule/) ; if ($line =~ /mrule=([0-9]+)/){ if( $index_current != $1 ){ $index_current = $1; if( defined($OUT) ){ close($OUT); } my $NEW = sprintf("%08d", $index_current); open($OUT, ">${NEW}.log") or die "Could not open destinati +on file. $! " ; } print $OUT $line; $i++ ; # if ($1 != $index_last){ # $index_current = $1 ; # close($OUT); # my $NEW = sprintf("%08d", $index_current); # open($OUT, ">${NEW}.log") or die "Could not open destinat +ion file. $! " ; # } # $index_last = $index_current ; } } close($FH); #close($OUT); if( defined($OUT) ){ close($OUT); }

    The above will be looking for an mrule=[0-9]+ pattern in the input. Once it finds one, it will check if the current index is the same as the one in the line and if not, it will close current filehandle and open another one with the new name. After that, it will print to the filehandle currently opened.

    Note that no filehandle is opened unless the pattern in the input appears.

    tested with the minimal file you had provided.

    bliako

      Works like a charm, thank you. How do I upvote this answer, sorry I am a newbie to perlmonks :)
      Also, why do we initialize, $current_index as -1?

        it is initialised to -1 (or whaterver other value your mrule number will not take) in order to force the opening of the file upon seeing the mrule pattern for the first time.

        Glad it worked (it's Perl after all) but I stress that it is untested by me for more complex cases.

        bliako
Re: Splitting a large file into smaller files according to indexes
by shmem (Chancellor) on Apr 06, 2018 at 08:43 UTC
    Can't use string ("00000140") as a symbol ref while "strict refs" in use at ./partition_file.pl line 26, <$FH> line 6.

    The code you posted doesn't produce this error message, it most likely stems from an earlier version of your program, in which you e.g. used $OUT overall instead of $NEW. In this case, in line 26, you would be trying to use a string as a filehandle.

    Are you sure that your "mrule" tokens are grouped? If after mrule=150 some line containing mrule=140 is found further down in the source file, you overwrite the file 00000140.log with that line.

    To avoid that, open a file for every mrule token and store the open filehandles in a hash:

    my %fh; while ( my $line = <$FH> ) { if ( $line =~ /mrule=([0-9]+)/){ my $mrule = sprintf("%08d.log", $1); if ( ! $fh{$mrule} ) { open my $fh, '>', $mrule or die "Can't write to '$mrule': +$!\n"; $fh{$mrule} = $fh; } print {$fh{$mrule}} $line; } } close $fh{$_} for keys %fh; # close all files

    Note the extra braces in the print statement around $fh{$mrule}. These are necessary to disambiguate $fh{$mrule} as being a filehandle rather than part of the LIST for print.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      > Are you sure that your "mrule" tokens are grouped ?

      Yes, I am sure the "mrule" tokens are grouped , You would not see something like this
      mpg=1 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=100 mpg=2 mrule=140 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 mpg=3 mrule=150 reg=7989 score=0 rank=0 perc=100 mp_demand=20 mpg=4 mrule=150 reg=7989 score=10625 rank=0 perc=100 mp_demand=40 mpg=3 mrule=140 reg=7989 score=0 rank=0 perc=100 mp_demand=20
      That is, we would not see a line with "mrule=140" repeat before and after a different maprule.

      I tried running your code, but probably it is hitting the file limit. After creating 252 files, I get an error as

      bos-mp96h:~ jvx$ ./partition_file.pl asgn.txt Can't write to '00000021.log': Too many open files bos-mp96h:~ jvx$ ls -ltr | grep 'log' | wc -l 252
        After creating 252 files, I get an error as ...

        Summing STDOUT,STDERR,STDIN to 252 gives 255... very tight limits you've got on your machine. What OS are you running?

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'