comment on

My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

    @set = (
                [ 'file00.csv', 'file01.csv', 'file02.csv', ],
                [ 'file03.csv', 'file04.csv', 'file05'csv', ],
                [ 'file06.csv', 'file07.csv', ],
                [ 'file08.csv', 'file09.csv', ],
                [ 'file10.csv', 'file11.csv', ],
            )
[download]

The partitioning would be accomplished by a loop similar to the following:

    # my $n              = 5;
    my @set;
    my $file_count;
    my $partition_size;
    my $remainder;

    $file_count     = scalar @file;                   # 12
    if ( $file_count >= $n ) {
        $partition_size = int( $file_count / $n );    # 2
        $remainder      = $file_count % $n;           # 2
    }
    else {
        $partition_size = 1;
        $remainder = 0;
    }
    my $i = 0;
    while ( scalar @file ) {
        foreach my $j ( 1 .. $partition_size ) {
            my $fn = shift @file;
            push @{$set[$i]}, $fn;
        }
        if ( $i < $remainder ) {
            my $fn = shift @file;
            push @{$set[$i]}, $fn;
        }
        $i++;
    }
[download]

At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

Thoughts?

Code implementing the above process:

#!/usr/bin/perl

use strict;
use warnings;

use Cwd;
use Data::Dumper;
use Getopt::Long;
use IO::Compress::Gzip qw( $GzipError );
use Text::CSV;

$Data::Dumper::Deepcopy = 1;
$Data::Dumper::Sortkeys = 1;

$| = 1;
srand();

my $output_files = 5;
my $outfile_name = $0 . q{.csv};
my $path         = q{./};

$outfile_name =~ s/\.pl.*$//g;

GetOptions(
    q{help} => sub {
        &help(
            output_files => $output_files,
            outfile_name => $outfile_name,
            path         => $path,
        );
    },
    q{output_files:i} => \$output_files,
    q{outfile_name:s} => \$outfile_name,
    q{path:s}         => \$path,
);

my $start_dir = getcwd;

if ( !-d $path ) {
    die qq{Directory $path not found: $!\n};
}

my @file = get_files( path => $path, );
my @set =
  partition_files( files => \@file, n => $output_files, );
write_subfiles( set => \@set, prefix => $outfile_name, );

#
# Subroutines
#
sub help {
    my ( %param, ) = @_;

    print sprintf
      <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{
+path};

Usage:
        $0
        $0 [--help]
        $0 [--max_lines N] [--outfile_name str] [--path str]

Where:
    outfile_name str       - Output filename prefix
                               (naming will be {prefix}-nn.csv;
                               default: %s).
    output_files N         - Device data into at most N files
                               (data in the same input file
                               will appear in the same file;
                               default: %d).
    path str               - Path to process
                               (default: %s).

HELP_TEXT
    exit;
}

sub get_files {
    my ( %param, ) = @_;

    my @file = ();

    if ( !exists $param{path} ) {
        return @file;
    }

    opendir my $dir, $param{path} or die $!;
    while ( my $fn = readdir($dir) ) {
        next if ( $fn =~ m/^.{1,2}$/ );
        next unless ( $fn =~ m/\.csv$/i );
        push @file, $fn;
    }
    closedir $dir;

    @file = sort { -s $a <=> -s $b } @file;

    return @file;
}

sub partition_files {
    my (%param) = @_;

    my @set;

    my $file_count;
    my $partition_size;
    my $remainder;

    my $n    = $param{n};
    my @file = @{ $param{files} };

    $file_count = scalar @file;    # 12
    if ( $file_count >= $n ) {
        $partition_size = int( $file_count / $n );    # 2
        $remainder      = $file_count % $n;           # 2
    }
    else {
        $partition_size = 1;
        $remainder      = 0;
    }
    my $i = 0;
    while ( scalar @file ) {

        foreach my $j ( 1 .. $partition_size ) {
            my $fn = shift @file;
            push @{ $set[$i] }, $fn;
        }
        if ( $i < $remainder ) {
            my $fn = shift @file;
            push @{ $set[$i] }, $fn;
        }
        $i++;
    }

    return @set;
}

sub write_subfiles {
    my (%param) = @_;

    my @set    = @{ $param{set} };
    my $prefix = $param{prefix};

    my $name_format =
        $prefix . q{-} . q{%0}
      . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d}
      . q{.csv} . q{.gz};

    my $csv =
      Text::CSV->new(
        { binary => 1, auto_diag => 1, eol => $/, } );

    foreach my $i ( 0 .. $#set ) {
        my $fn = sprintf $name_format, $i;

        my $z = new IO::Compress::Gzip $fn,
          -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION,
          or die
          qq{IO::Compress::Gzip failed: $GzipError\n};

        foreach my $ifn ( @{ $set[$i] } ) {
            my $flag = 1;
            open my $ifh, q{<:encoding(utf8)}, $ifn
              or die qq{$ifn: $!};
            while ( my $row = $csv->getline($ifh) ) {
                if ($flag) {
                    $flag--;
                    next;
                }
                my $status = $csv->print( $z, $row, );
                $row = undef;
            }
            close $ifh;
        }
        $z->close;
    }
}
[download]

2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

2019-08-13: Added code implementing the described process.

2019-08-13: Reformatted added code using perltidy -l 60 -ple.

In reply to Re: Complex file manipulation challenge by atcroft
in thread Complex file manipulation challenge by jdporter

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.