http://qs321.pair.com?node_id=11104413


in reply to Complex file manipulation challenge

My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

@set = ( [ 'file00.csv', 'file01.csv', 'file02.csv', ], [ 'file03.csv', 'file04.csv', 'file05'csv', ], [ 'file06.csv', 'file07.csv', ], [ 'file08.csv', 'file09.csv', ], [ 'file10.csv', 'file11.csv', ], )
The partitioning would be accomplished by a loop similar to the following:
# my $n = 5; my @set; my $file_count; my $partition_size; my $remainder; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{$set[$i]}, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{$set[$i]}, $fn; } $i++; }

At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

Thoughts?

Code implementing the above process:

#!/usr/bin/perl use strict; use warnings; use Cwd; use Data::Dumper; use Getopt::Long; use IO::Compress::Gzip qw( $GzipError ); use Text::CSV; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; $| = 1; srand(); my $output_files = 5; my $outfile_name = $0 . q{.csv}; my $path = q{./}; $outfile_name =~ s/\.pl.*$//g; GetOptions( q{help} => sub { &help( output_files => $output_files, outfile_name => $outfile_name, path => $path, ); }, q{output_files:i} => \$output_files, q{outfile_name:s} => \$outfile_name, q{path:s} => \$path, ); my $start_dir = getcwd; if ( !-d $path ) { die qq{Directory $path not found: $!\n}; } my @file = get_files( path => $path, ); my @set = partition_files( files => \@file, n => $output_files, ); write_subfiles( set => \@set, prefix => $outfile_name, ); # # Subroutines # sub help { my ( %param, ) = @_; print sprintf <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{ +path}; Usage: $0 $0 [--help] $0 [--max_lines N] [--outfile_name str] [--path str] Where: outfile_name str - Output filename prefix (naming will be {prefix}-nn.csv; default: %s). output_files N - Device data into at most N files (data in the same input file will appear in the same file; default: %d). path str - Path to process (default: %s). HELP_TEXT exit; } sub get_files { my ( %param, ) = @_; my @file = (); if ( !exists $param{path} ) { return @file; } opendir my $dir, $param{path} or die $!; while ( my $fn = readdir($dir) ) { next if ( $fn =~ m/^.{1,2}$/ ); next unless ( $fn =~ m/\.csv$/i ); push @file, $fn; } closedir $dir; @file = sort { -s $a <=> -s $b } @file; return @file; } sub partition_files { my (%param) = @_; my @set; my $file_count; my $partition_size; my $remainder; my $n = $param{n}; my @file = @{ $param{files} }; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } $i++; } return @set; } sub write_subfiles { my (%param) = @_; my @set = @{ $param{set} }; my $prefix = $param{prefix}; my $name_format = $prefix . q{-} . q{%0} . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d} . q{.csv} . q{.gz}; my $csv = Text::CSV->new( { binary => 1, auto_diag => 1, eol => $/, } ); foreach my $i ( 0 .. $#set ) { my $fn = sprintf $name_format, $i; my $z = new IO::Compress::Gzip $fn, -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION, or die qq{IO::Compress::Gzip failed: $GzipError\n}; foreach my $ifn ( @{ $set[$i] } ) { my $flag = 1; open my $ifh, q{<:encoding(utf8)}, $ifn or die qq{$ifn: $!}; while ( my $row = $csv->getline($ifh) ) { if ($flag) { $flag--; next; } my $status = $csv->print( $z, $row, ); $row = undef; } close $ifh; } $z->close; } }

2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

2019-08-13: Added code implementing the described process.

2019-08-13: Reformatted added code using perltidy -l 60 -ple.