My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.
Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:
The partitioning would be accomplished by a loop similar to the following:@set = ( [ 'file00.csv', 'file01.csv', 'file02.csv', ], [ 'file03.csv', 'file04.csv', 'file05'csv', ], [ 'file06.csv', 'file07.csv', ], [ 'file08.csv', 'file09.csv', ], [ 'file10.csv', 'file11.csv', ], )
# my $n = 5; my @set; my $file_count; my $partition_size; my $remainder; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{$set[$i]}, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{$set[$i]}, $fn; } $i++; }
At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.
This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.
Thoughts?
Code implementing the above process:
#!/usr/bin/perl use strict; use warnings; use Cwd; use Data::Dumper; use Getopt::Long; use IO::Compress::Gzip qw( $GzipError ); use Text::CSV; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; $| = 1; srand(); my $output_files = 5; my $outfile_name = $0 . q{.csv}; my $path = q{./}; $outfile_name =~ s/\.pl.*$//g; GetOptions( q{help} => sub { &help( output_files => $output_files, outfile_name => $outfile_name, path => $path, ); }, q{output_files:i} => \$output_files, q{outfile_name:s} => \$outfile_name, q{path:s} => \$path, ); my $start_dir = getcwd; if ( !-d $path ) { die qq{Directory $path not found: $!\n}; } my @file = get_files( path => $path, ); my @set = partition_files( files => \@file, n => $output_files, ); write_subfiles( set => \@set, prefix => $outfile_name, ); # # Subroutines # sub help { my ( %param, ) = @_; print sprintf <<HELP_TEXT, $param{outfile_name}, $param{output_files}, $param{ +path}; Usage: $0 $0 [--help] $0 [--max_lines N] [--outfile_name str] [--path str] Where: outfile_name str - Output filename prefix (naming will be {prefix}-nn.csv; default: %s). output_files N - Device data into at most N files (data in the same input file will appear in the same file; default: %d). path str - Path to process (default: %s). HELP_TEXT exit; } sub get_files { my ( %param, ) = @_; my @file = (); if ( !exists $param{path} ) { return @file; } opendir my $dir, $param{path} or die $!; while ( my $fn = readdir($dir) ) { next if ( $fn =~ m/^.{1,2}$/ ); next unless ( $fn =~ m/\.csv$/i ); push @file, $fn; } closedir $dir; @file = sort { -s $a <=> -s $b } @file; return @file; } sub partition_files { my (%param) = @_; my @set; my $file_count; my $partition_size; my $remainder; my $n = $param{n}; my @file = @{ $param{files} }; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{ $set[$i] }, $fn; } $i++; } return @set; } sub write_subfiles { my (%param) = @_; my @set = @{ $param{set} }; my $prefix = $param{prefix}; my $name_format = $prefix . q{-} . q{%0} . int( log( scalar @set ) / log(10) + 1 + 1 ) . q{d} . q{.csv} . q{.gz}; my $csv = Text::CSV->new( { binary => 1, auto_diag => 1, eol => $/, } ); foreach my $i ( 0 .. $#set ) { my $fn = sprintf $name_format, $i; my $z = new IO::Compress::Gzip $fn, -Level => IO::Compress::Gzip::Z_BEST_COMPRESSION, or die qq{IO::Compress::Gzip failed: $GzipError\n}; foreach my $ifn ( @{ $set[$i] } ) { my $flag = 1; open my $ifh, q{<:encoding(utf8)}, $ifn or die qq{$ifn: $!}; while ( my $row = $csv->getline($ifh) ) { if ($flag) { $flag--; next; } my $status = $csv->print( $z, $row, ); $row = undef; } close $ifh; } $z->close; } }
2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).
2019-08-13: Added code implementing the described process.
2019-08-13: Reformatted added code using perltidy -l 60 -ple.
In reply to Re: Complex file manipulation challenge
by atcroft
in thread Complex file manipulation challenge
by jdporter
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |