http://qs1969.pair.com?node_id=11104399

jdporter has asked for the wisdom of the Perl Monks concerning the following question:

One of the packages in Scaladex (Scala's equivalent of CPAN, I guess) states its motivation in the form of the following problem:

  1. List all .csv files in a directory by increasing order of file size
  2. Drop the first line of each file and concat the rest into a single output file
  3. Split the above output file into n smaller files without breaking up the lines in the input files
  4. gzip each of the smaller output files

How would you do this in Perl?

I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.

Replies are listed 'Best First'.
Re: Complex file manipulation challenge
by haukex (Archbishop) on Aug 13, 2019 at 17:52 UTC
    #!/usr/bin/env perl use warnings; use strict; use Path::Class qw/dir/; use IO::Compress::Gzip qw/Z_BEST_COMPRESSION/; use Text::CSV; # also install Text::CSV_XS for speed my $VERBOSE = 1; # Note: Currently using a fixed pattern for the output file names die "Usage: $0 DIR LINES\n" unless @ARGV==2; my $DIR = dir($ARGV[0]); die "Bad DIR '$DIR'" unless -d $DIR; my $LINES = $ARGV[1]; $LINES =~ /\A(?!0)[0-9]+\z/ or die "Bad LINES '$LINES'\n"; my @files = map { $$_[0] } sort { $$a[1]<=>$$b[1] } map { [$_,-s $_] } grep { !$_->is_dir && $_->basename=~/\.csv\z/i } $DIR->children; my $csv = Text::CSV->new({ binary=>1, auto_diag=>2, eol=>$/ }); my ($ofcnt,$ofln,$ofh) = (0,0); for my $infile (@files) { print STDERR "Reading $infile\n" if $VERBOSE; my $ifh = $infile->openr; $csv->getline($ifh); # drop first line while ( my $row = $csv->getline($ifh) ) { if (!defined $ofh) { my $outfile = $DIR->file("part-".(++$ofcnt).".csv.gz"); if ( -e $outfile ) { warn "Warning: Overwriting $outfile\n" } else { print STDERR "Writing $outfile\n" if $VERBOSE } $ofh = IO::Compress::Gzip->new( $outfile->openw, Level => Z_BEST_COMPRESSION ) or die "$outfile: $IO::Compress::Gzip::GzipError\n"; $ofln = 0; } $csv->print($ofh, $row); if ( ++$ofln >= $LINES ) { $ofh->close; $ofh=undef } } $csv->eof or $csv->error_diag; } $ofh->close if defined $ofh;
Re: Complex file manipulation challenge
by Tux (Canon) on Aug 13, 2019 at 17:51 UTC

    Untested (assuming 2500 lines as cut-of)

    1. my @csv_files = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, -s ] } glob "*.csv"
    2. my @csv = map { @{csv (in => $_, headers => "skip")} } @csv_files;
    3. use Text::CSV_XS qw( csv );
      use PerlIO::via::gzip;
      my $n = "0000";
      while (@csv) {
    4.     open my $fh, ">:via(gzip)", "new".$n++.".csv.gz";
          csv (in => [ splice @csv, 0, (@csv > 2500 ? 2500 : $#csv) ], out => $fh);}
          }

    Enjoy, Have FUN! H.Merijn

      Nice! Note that the original problem statement includes this:

      Note: Your program should work when files are much bigger than memory in your JVM and must close all open resources correctly

        I did not read the original problem statement :)

        csv (in => $fh, out => undef, on_in => sub { ... }); supports streaming and does not store in memory (other than the current record. Rewriting my version to do that can be an exercise to the reader.

        In preparation I found that PerlIO::via::gzip *only* supports open my $fh, ":via(gzip)", "file.gz"; and *not* open my $fh, ">", "file.gz"; binmode $fh, ":via(gzip)"; :( :(


        Enjoy, Have FUN! H.Merijn
Re: Complex file manipulation challenge
by atcroft (Abbot) on Aug 13, 2019 at 19:20 UTC

    My code may not be as elegant as others, and my approach, while attempting to follow the spirit of the guidelines, would definitely not follow the letter of it.

    Knowing that I would generate N files, I would retrieve the ordered list from step 1. At that point, I would create an AoA into which I would push the appropriate file name. (Given 12 files of ascending size and a target of 5 output files, for example, I would create the following:

    @set = ( [ 'file00.csv', 'file01.csv', 'file02.csv', ], [ 'file03.csv', 'file04.csv', 'file05'csv', ], [ 'file06.csv', 'file07.csv', ], [ 'file08.csv', 'file09.csv', ], [ 'file10.csv', 'file11.csv', ], )
    The partitioning would be accomplished by a loop similar to the following:
    # my $n = 5; my @set; my $file_count; my $partition_size; my $remainder; $file_count = scalar @file; # 12 if ( $file_count >= $n ) { $partition_size = int( $file_count / $n ); # 2 $remainder = $file_count % $n; # 2 } else { $partition_size = 1; $remainder = 0; } my $i = 0; while ( scalar @file ) { foreach my $j ( 1 .. $partition_size ) { my $fn = shift @file; push @{$set[$i]}, $fn; } if ( $i < $remainder ) { my $fn = shift @file; push @{$set[$i]}, $fn; } $i++; }

    At this point, it would seem at first blush to be a relatively easy thing to open the intended output file, loop through its list of files using Text::CSV to read them line by line (skipping the first line) and writing the lines to the output file using an IO::Compress::Gzip file handle and Text::CSV's print() method.

    This avoids writing the temporary file, or having to add a marker to avoid splitting lines from an input file when writing the subfiles.

    Thoughts?

    Code implementing the above process:

    2019-08-13: Edited for case of fewer files than requested partitions (will create only as many partitions as files exist).

    2019-08-13: Added code implementing the described process.

    2019-08-13: Reformatted added code using perltidy -l 60 -ple.

Re: Complex file manipulation challenge
by Marshall (Canon) on Aug 13, 2019 at 20:10 UTC
    Update: I got a negative vote for questioning the purpose of the OP's question. My bad. However, if the question is modified to reflect a more realistic scenario, there would be some interesting answers applicable in the real world.

    I guess this is a Golf question?
    I was working on a solution until I got to step 3 and realized that this requirement makes no sense.
    It is so silly that I can't image a real world use for it!

    A more real world thing might be having to write a humongous amount of data to a multiple CD data set where no single file can span a CD boundary. I saw this sort of requirement in the olden floppy disk days. When loading say 20 diskettes in a data set, you want to keep going even if diskette #5 has a fatal error. At the end, say 19 diskettes loaded and one didn't. Now we can get one diskette and patch the system with that single diskette in a straightforward way.

    I have no idea of a practical use for this requirement.
    Here is where I stopped:
    BTW, I see no need to parse the .csv file. My gosh, I am unaware of any CSV file that is \n field delimited - what that would mean boggles my mind and would result in some confused display with a text editor.

    #!/usr/bin/perl use strict; use warnings; # node: 11104399 use constant DIR => ""; #set these as needed... use constant N_FILES => 3; # step 1: List all .csv files in a directory by increasing order of fi +le size my @files_by_size = sort ($a -s <=> $b -s}<DIR/*.csv>; print join (@files_by_size,"\n"),"\n"; # step 2: Drop the first line of each file and concat the rest into a +single output file open OUT, '>', "BigFile" or die "...blah..$!"; foreach my $infile (@files_by_size) { open my $infile, '<', $infile or die "unable to open $infile $!"; <$infile>; #throw away first line of file print OUT while <$infile>; } close OUT; # $infile already closed... # step 3:Split the above output file into "n" smaller files # without breaking up the lines in the input files # # This is a strange requirement! A VERY strange requirement! # the obvious thing to do is to make n-1 "big" files and throw # what is leftover into the nth file (which will be very small) # # The tricky part here is to make sure that at least one line # winds up in the nth file. Now that I think about it... # # geez if n==3 and outsize = total bytes, # Create file1 and write lines >= total_bytes/2 to it. # write one line to file 2. # write the rest of lines to file 3. my $big_files = $n-1; # stopped at this point because this sounds like a Golf situation # with a very contrived situation and I'm not good at that. #step 4: this is easy

      I am unaware of any CSV file that is \n field delimited

      Neither have I, but I have handled CSV files with embedded newlines in quoted fields. Usually these are exported from a spreadsheet program.

        That is indeed a good point++!

        In Excel, there is some kind of formatting option to wrap a line onto another line depending upon the column width. There may be some kind of option to insert a GUI line break that doesn't appear in the CSV (maybe CTL-Enter)? Not sure that is possible.

        However, you are quite correct in that multiple lines within a column is something to be considered -- think about a single field for an address instead of multiple columns for each line of the address.

        All of the CSV files that I currently work with containing addresses are | delimited, have separate columns for each potential line of the address and disallow the | char within an address. So a bit of tunnel vision on my part! Sorry!

        You are quite correct to point out this possibility.

        BTW: I've seen CSV files with 512 or 1024 fields. These things can have humongous line lengths. Perl is very good at getting me the dozen or so fields that I care about.