hiptoss has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm trying to determine whether or not perl can provide me with an equivalent solution to the command line utility split when dealing with very large files. (where very large is in the 10-20gb neighborhood)

I know this has been asked before, and I've tried some of the solutions only to have perl return out of memory errors. The latest example code that I've used is below. Are there any methods to better deal with larger files, or should I simply stick to using split?

In case it matters, I'm running this code on a machine with 16gb of memory, and testing with a 15gb file.

#!/usr/bin/perl -w

use strict;
use warnings;

my $parts = shift; ### how many parts to split
my @file = @ARGV;  ### the files to split

foreach ( @file ) {

    ### how big should the new file be?
    my $size = (-s) / $parts;

    ### open the input file
    open my $in_fh, $_ or warn "Cannot read $_: $!";
    binmode $in_fh;

    ### for all but the last part, read
    ### the amount of data, then write it to
    ### the appropriate output file.
    for my $part (1 .. $parts - 1) {

        ### read an output file worth of data
        read $in_fh, my $buffer, $size or warn "Read zero bytes from $_: $!";

        ### write the output file
        open my $fh, "> $_$part" or warn "Cannot write to $_$part: $!";
        print $fh $buffer;
    }

    # for the last part, read the rest of
    # the file. Buffer will shrink
    # to the actual bytes read.
    read $in_fh, my $buffer, -s or warn "Read zero bytes from $_: $!";
    open my $fh, "> $_$parts" or warn "Cannot write to $_$parts: $!";
    print $fh $buffer;
}
EDIT: After implementing SuicideJunkie's suggestion, I am still receiving an out of memory error. Interestingly, however, the files all seem to be created and when I cat them back together, the newly created file has the same md5 as the original:

(root@sw178) zs3 > du -sh lolz.dmg
16G     lolz.dmg
(root@sw178) zs3 > time ./z3-perl.pl lolz.dmg
 Out of memory!

real    11m10.656s
user    0m12.895s
sys     0m45.782s
(root@sw178) zs3 >  ls -l lolz*0-9
-rw-r--r-- 1 root root 1073741824 Nov  9 13:30 lolz.dmg1
-rw-r--r-- 1 root root 1073741824 Nov  9 13:36 lolz.dmg10
-rw-r--r-- 1 root root 1073741824 Nov  9 13:37 lolz.dmg11
-rw-r--r-- 1 root root 1073741824 Nov  9 13:37 lolz.dmg12
-rw-r--r-- 1 root root 1073741824 Nov  9 13:38 lolz.dmg13
-rw-r--r-- 1 root root 1073741824 Nov  9 13:39 lolz.dmg14
-rw-r--r-- 1 root root 1073741824 Nov  9 13:40 lolz.dmg15
-rw-r--r-- 1 root root  434690997 Nov  9 13:40 lolz.dmg16
-rw-r--r-- 1 root root 1073741824 Nov  9 13:31 lolz.dmg2
-rw-r--r-- 1 root root 1073741824 Nov  9 13:31 lolz.dmg3
-rw-r--r-- 1 root root 1073741824 Nov  9 13:32 lolz.dmg4
-rw-r--r-- 1 root root 1073741824 Nov  9 13:32 lolz.dmg5
-rw-r--r-- 1 root root 1073741824 Nov  9 13:33 lolz.dmg6
-rw-r--r-- 1 root root 1073741824 Nov  9 13:34 lolz.dmg7
-rw-r--r-- 1 root root 1073741824 Nov  9 13:35 lolz.dmg8
-rw-r--r-- 1 root root 1073741824 Nov  9 13:35 lolz.dmg9
(root@sw178) zs3 > time for i in `seq 1 16`; do cat lolz.dmg$i >> newlolz.dmg; done

real    10m55.629s
user    0m4.047s
sys     0m42.704s

(root@sw178) zs3 > md5sum lolz.dmg newlolz.dmg
e9b776914d65da41730265371a84d279  lolz.dmg
e9b776914d65da41730265371a84d279  newlolz.dmg



Should I just ignore the warning, or will this come back to bite me in some situation(s) ? This is my new code:


#!/usr/bin/perl -w

use strict;
use warnings;

my $part = 1;
my @file = @ARGV;  ### the files to split
my $chunk = 1073741824; #1gb
my ($buffer, $size);

foreach ( @file ) {

    #- open the input file
    open my $in_fh, $_ or warn "Cannot read $_: $!";
    binmode $in_fh;

    #- for all but the last part, read
    #- the amount of data, then output to file

    my $sizeRead = $chunk;
    while ($sizeRead == $chunk)
    {
      #- read an output file worth of data
      $sizeRead = read $in_fh, $buffer, $chunk;
      die "Error reading: $!\n" unless defined $sizeRead;

      #- write the output file
      open my $fh, "> $_$part" or warn "Cannot write to $_$part: $!";
      print $fh $buffer;

      #- increment counter for part#
      $part++;
    }

    #- for the last part, read the rest of
    #- the file.

    read $in_fh, my $buffer, -s or warn "Read zero bytes from $_: $!";
    open my $fh, "> $_$part" or warn "Cannot write to $_$part: $!";
    print $fh $buffer;
}

Replies are listed 'Best First'.
Re: Chunking very large files
by SuicideJunkie (Vicar) on Nov 09, 2011 at 19:22 UTC
    read $in_fh, my $buffer, -s or warn "Read zero bytes from $_: $!";

    That line seems a bit suspicious. Isn't that going to require a buffer the size of the whole file, which will then only be partly filled before you hit EOF?

    Why not make that for loop a while loop, and allow it to continue until it is done? Just keep reading $size sized chunks until you get to EoF and read less than the buffer's size worth of data. No need to treat the last read differently.

    Also, wouldn't it make more sense to break the files up into fixed size chunks, rather than a fixed number of chunks each? The reason to chunk them in the first place is so they fit in memory or on a USB stick or something right?

      You're right in that the final version will have a fixed chunk size. This is simply an exercise to figure out if I can do it or not, and then I'll change the code accordingly.

      I am not very experienced with buffers, and the code I pasted is from an earlier answer on perlmonks to a very similar question that I asked, only they were dealing with smaller files. I'll see if I can figure out how to adjust the loop and use fixed chunk sizes.

      Thanks for your advice.

        Should be as simple as:

        my $sizeRead = $chunkSize; while ($sizeRead == $chunkSize) { $sizeRead = read $in_fh, $buffer, $size; die "Error reading: $!\n" unless defined $sizeRead; ... }

Re: Chunking very large files
by Anonymous Monk on Nov 10, 2011 at 05:27 UTC
    my $chunk = 1073741824; #1gb [...] #- read an output file worth of data $sizeRead = read $in_fh, $buffer, $chunk;

    I think that is highly suspect. You really should not be reading that much at once, but instead read 4K..128K chunks in a loop and write them to the destination file immediately.

Re: Chunking very large files
by Marshall (Canon) on Nov 11, 2011 at 11:43 UTC
    I also think that this last -s read may be causing some trouble. Do NOT ignore a memory error.

    reading in 1GB chunks is probably ok on your machine although there is no performance help for reading more than 64Kbytes at a time.

    Simplify you code so that all is handled in one case, perhaps like below pseudo code.

    Just keep asking for the same size chunk each time, if the file has less than that size, you will only get what is left. There is no need for a special case to handle the last chunk. If you want to test for undef vs 0 bytes case, do that after the while() loop.

    foreach (@file) { open input file or die... my $part=1; my $sizeRead; while ($sizeRead = read $in_fh, $buffer, $chunk;) { open out file for the current part number increment part number for next time if there is such write the data...using $sizeRead (what was actually read) } if $sizeRead is undef, last read didn't "work". }