Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

@set = ( [ 'file00.csv', 'file01.csv', 'file02.csv', ], [ 'file03.csv', 'file04.csv', 'file05'csv', ], [ 'file06.csv', 'file07.csv', ], [ 'file08.csv', 'file09.csv', ], [ 'file10.csv', 'file11.csv', ], )
The partitioning would be accomplished by a loop similar to the following:
# my $n = 5; my @set; my $file_count; my $partition_size; my $remainder; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{$set[$i]}, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{$set[$i]}, $fn; } $i++; }

At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

Thoughts?

Code implementing the above process:

#!/usr/bin/perl use strict; use warnings; use Cwd; use Data::Dumper; use Getopt::Long; use IO::Compress::Gzip qw( $GzipError ); use Text::CSV; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; $| = 1; srand(); my $output_files = 5; my $outfile_name = $0 . q{.csv}; my $path = q{./}; $outfile_name =~ s/\.pl.*$//g; GetOptions( q{help} => sub { &help( output_files => $output_files, outfile_name => $outfile_name, path => $path, ); }, q{output_files:i} => \$output_files, q{outfile_name:s} => \$outfile_name, q{path:s} => \$path, ); my $start_dir = getcwd; if ( !-d $path ) { die qq{Directory $path not found: $!\n}; } my @file = get_files( path => $path, ); my @set = partition_files( files => \@file, n => $output_files, ); write_subfiles( set => \@set, prefix => $outfile_name, ); # # Subroutines # sub help { my ( %param, ) = @_; print sprintf <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{ +path}; Usage: $0 $0 [--help] $0 [--max_lines N] [--outfile_name str] [--path str] Where: outfile_name str - Output filename prefix (naming will be {prefix}-nn.csv; default: %s). output_files N - Device data into at most N files (data in the same input file will appear in the same file; default: %d). path str - Path to process (default: %s). HELP_TEXT exit; } sub get_files { my ( %param, ) = @_; my @file = (); if ( !exists $param{path} ) { return @file; } opendir my $dir, $param{path} or die $!; while ( my $fn = readdir($dir) ) { next if ( $fn =~ m/^.{1,2}$/ ); next unless ( $fn =~ m/\.csv$/i ); push @file, $fn; } closedir $dir; @file = sort { -s $a <=> -s $b } @file; return @file; } sub partition_files { my (%param) = @_; my @set; my $file_count; my $partition_size; my $remainder; my $n = $param{n}; my @file = @{ $param{files} }; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } $i++; } return @set; } sub write_subfiles { my (%param) = @_; my @set = @{ $param{set} }; my $prefix = $param{prefix}; my $name_format = $prefix . q{-} . q{%0} . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d} . q{.csv} . q{.gz}; my $csv = Text::CSV->new( { binary => 1, auto_diag => 1, eol => $/, } ); foreach my $i ( 0 .. $#set ) { my $fn = sprintf $name_format, $i; my $z = new IO::Compress::Gzip $fn, -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION, or die qq{IO::Compress::Gzip failed: $GzipError\n}; foreach my $ifn ( @{ $set[$i] } ) { my $flag = 1; open my $ifh, q{<:encoding(utf8)}, $ifn or die qq{$ifn: $!}; while ( my $row = $csv->getline($ifh) ) { if ($flag) { $flag--; next; } my $status = $csv->print( $z, $row, ); $row = undef; } close $ifh; } $z->close; } }

2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

2019-08-13: Added code implementing the described process.

2019-08-13: Reformatted added code using perltidy -l 60 -ple.


In reply to Re: Complex file manipulation challenge by atcroft
in thread Complex file manipulation challenge by jdporter

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-26 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found