My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

@set = ( [ 'file00.csv', 'file01.csv', 'file02.csv', ], [ 'file03.csv', 'file04.csv', 'file05'csv', ], [ 'file06.csv', 'file07.csv', ], [ 'file08.csv', 'file09.csv', ], [ 'file10.csv', 'file11.csv', ], )
The partitioning would be accomplished by a loop similar to the following:
# my $n = 5; my @set; my $file_count; my $partition_size; my $remainder; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{$set[$i]}, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{$set[$i]}, $fn; } $i++; }

At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

Thoughts?

Code implementing the above process:

#!/usr/bin/perl use strict; use warnings; use Cwd; use Data::Dumper; use Getopt::Long; use IO::Compress::Gzip qw( $GzipError ); use Text::CSV; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; $| = 1; srand(); my $output_files = 5; my $outfile_name = $0 . q{.csv}; my $path = q{./}; $outfile_name =~ s/\.pl.*$//g; GetOptions( q{help} => sub { &help( output_files => $output_files, outfile_name => $outfile_name, path => $path, ); }, q{output_files:i} => \$output_files, q{outfile_name:s} => \$outfile_name, q{path:s} => \$path, ); my $start_dir = getcwd; if ( !-d $path ) { die qq{Directory $path not found: $!\n}; } my @file = get_files( path => $path, ); my @set = partition_files( files => \@file, n => $output_files, ); write_subfiles( set => \@set, prefix => $outfile_name, ); # # Subroutines # sub help { my ( %param, ) = @_; print sprintf <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{ +path}; Usage: $0 $0 [--help] $0 [--max_lines N] [--outfile_name str] [--path str] Where: outfile_name str - Output filename prefix (naming will be {prefix}-nn.csv; default: %s). output_files N - Device data into at most N files (data in the same input file will appear in the same file; default: %d). path str - Path to process (default: %s). HELP_TEXT exit; } sub get_files { my ( %param, ) = @_; my @file = (); if ( !exists $param{path} ) { return @file; } opendir my $dir, $param{path} or die $!; while ( my $fn = readdir($dir) ) { next if ( $fn =~ m/^.{1,2}$/ ); next unless ( $fn =~ m/\.csv$/i ); push @file, $fn; } closedir $dir; @file = sort { -s $a <=> -s $b } @file; return @file; } sub partition_files { my (%param) = @_; my @set; my $file_count; my $partition_size; my $remainder; my $n = $param{n}; my @file = @{ $param{files} }; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } $i++; } return @set; } sub write_subfiles { my (%param) = @_; my @set = @{ $param{set} }; my $prefix = $param{prefix}; my $name_format = $prefix . q{-} . q{%0} . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d} . q{.csv} . q{.gz}; my $csv = Text::CSV->new( { binary => 1, auto_diag => 1, eol => $/, } ); foreach my $i ( 0 .. $#set ) { my $fn = sprintf $name_format, $i; my $z = new IO::Compress::Gzip $fn, -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION, or die qq{IO::Compress::Gzip failed: $GzipError\n}; foreach my $ifn ( @{ $set[$i] } ) { my $flag = 1; open my $ifh, q{<:encoding(utf8)}, $ifn or die qq{$ifn: $!}; while ( my $row = $csv->getline($ifh) ) { if ($flag) { $flag--; next; } my $status = $csv->print( $z, $row, ); $row = undef; } close $ifh; } $z->close; } }

2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

2019-08-13: Added code implementing the described process.

2019-08-13: Reformatted added code using perltidy -l 60 -ple.


In reply to Re: Complex file manipulation challenge by atcroft
in thread Complex file manipulation challenge by jdporter

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.