Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Complex file manipulation challenge

by atcroft (Abbot)
on Aug 13, 2019 at 19:20 UTC ( #11104413=note: print w/replies, xml ) Need Help??


in reply to Complex file manipulation challenge

My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

@set = ( [ 'file00.csv', 'file01.csv', 'file02.csv', ], [ 'file03.csv', 'file04.csv', 'file05'csv', ], [ 'file06.csv', 'file07.csv', ], [ 'file08.csv', 'file09.csv', ], [ 'file10.csv', 'file11.csv', ], )
The partitioning would be accomplished by a loop similar to the following:
# my $n = 5; my @set; my $file_count; my $partition_size; my $remainder; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{$set[$i]}, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{$set[$i]}, $fn; } $i++; }

At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

Thoughts?

Code implementing the above process:

#!/usr/bin/perl use strict; use warnings; use Cwd; use Data::Dumper; use Getopt::Long; use IO::Compress::Gzip qw( $GzipError ); use Text::CSV; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; $| = 1; srand(); my $output_files = 5; my $outfile_name = $0 . q{.csv}; my $path = q{./}; $outfile_name =~ s/\.pl.*$//g; GetOptions( q{help} => sub { &help( output_files => $output_files, outfile_name => $outfile_name, path => $path, ); }, q{output_files:i} => \$output_files, q{outfile_name:s} => \$outfile_name, q{path:s} => \$path, ); my $start_dir = getcwd; if ( !-d $path ) { die qq{Directory $path not found: $!\n}; } my @file = get_files( path => $path, ); my @set = partition_files( files => \@file, n => $output_files, ); write_subfiles( set => \@set, prefix => $outfile_name, ); # # Subroutines # sub help { my ( %param, ) = @_; print sprintf <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{ +path}; Usage: $0 $0 [--help] $0 [--max_lines N] [--outfile_name str] [--path str] Where: outfile_name str - Output filename prefix (naming will be {prefix}-nn.csv; default: %s). output_files N - Device data into at most N files (data in the same input file will appear in the same file; default: %d). path str - Path to process (default: %s). HELP_TEXT exit; } sub get_files { my ( %param, ) = @_; my @file = (); if ( !exists $param{path} ) { return @file; } opendir my $dir, $param{path} or die $!; while ( my $fn = readdir($dir) ) { next if ( $fn =~ m/^.{1,2}$/ ); next unless ( $fn =~ m/\.csv$/i ); push @file, $fn; } closedir $dir; @file = sort { -s $a <=> -s $b } @file; return @file; } sub partition_files { my (%param) = @_; my @set; my $file_count; my $partition_size; my $remainder; my $n = $param{n}; my @file = @{ $param{files} }; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } $i++; } return @set; } sub write_subfiles { my (%param) = @_; my @set = @{ $param{set} }; my $prefix = $param{prefix}; my $name_format = $prefix . q{-} . q{%0} . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d} . q{.csv} . q{.gz}; my $csv = Text::CSV->new( { binary => 1, auto_diag => 1, eol => $/, } ); foreach my $i ( 0 .. $#set ) { my $fn = sprintf $name_format, $i; my $z = new IO::Compress::Gzip $fn, -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION, or die qq{IO::Compress::Gzip failed: $GzipError\n}; foreach my $ifn ( @{ $set[$i] } ) { my $flag = 1; open my $ifh, q{<:encoding(utf8)}, $ifn or die qq{$ifn: $!}; while ( my $row = $csv->getline($ifh) ) { if ($flag) { $flag--; next; } my $status = $csv->print( $z, $row, ); $row = undef; } close $ifh; } $z->close; } }

2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

2019-08-13: Added code implementing the described process.

2019-08-13: Reformatted added code using perltidy -l 60 -ple.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11104413]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2020-09-24 15:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (134 votes). Check out past polls.

    Notices?