comment on

Hello, I'm trying to determine whether or not perl can provide me with an equivalent solution to the command line utility split when dealing with very large files. (where very large is in the 10-20gb neighborhood)

I know this has been asked before, and I've tried some of the solutions only to have perl return out of memory errors. The latest example code that I've used is below. Are there any methods to better deal with larger files, or should I simply stick to using split?

In case it matters, I'm running this code on a machine with 16gb of memory, and testing with a 15gb file.

#!/usr/bin/perl -w

use strict;
use warnings;

my $parts = shift; ### how many parts to split
my @file = @ARGV;  ### the files to split

foreach ( @file ) {

    ### how big should the new file be?
    my $size = (-s) / $parts;

    ### open the input file
    open my $in_fh, $_ or warn "Cannot read $_: $!";
    binmode $in_fh;

    ### for all but the last part, read
    ### the amount of data, then write it to
    ### the appropriate output file.
    for my $part (1 .. $parts - 1) {

        ### read an output file worth of data
        read $in_fh, my $buffer, $size or warn "Read zero bytes from $_: $!";

        ### write the output file
        open my $fh, "> $_$part" or warn "Cannot write to $_$part: $!";
        print $fh $buffer;
    }

    # for the last part, read the rest of
    # the file. Buffer will shrink
    # to the actual bytes read.
    read $in_fh, my $buffer, -s or warn "Read zero bytes from $_: $!";
    open my $fh, "> $_$parts" or warn "Cannot write to $_$parts: $!";
    print $fh $buffer;
}

EDIT: After implementing SuicideJunkie's suggestion, I am still receiving an out of memory error. Interestingly, however, the files all seem to be created and when I cat them back together, the newly created file has the same md5 as the original:

(root@sw178) zs3 > du -sh lolz.dmg
16G     lolz.dmg
(root@sw178) zs3 > time ./z3-perl.pl lolz.dmg
 Out of memory!

real    11m10.656s
user    0m12.895s
sys     0m45.782s
(root@sw178) zs3 >  ls -l lolz*0-9
-rw-r--r-- 1 root root 1073741824 Nov  9 13:30 lolz.dmg1
-rw-r--r-- 1 root root 1073741824 Nov  9 13:36 lolz.dmg10
-rw-r--r-- 1 root root 1073741824 Nov  9 13:37 lolz.dmg11
-rw-r--r-- 1 root root 1073741824 Nov  9 13:37 lolz.dmg12
-rw-r--r-- 1 root root 1073741824 Nov  9 13:38 lolz.dmg13
-rw-r--r-- 1 root root 1073741824 Nov  9 13:39 lolz.dmg14
-rw-r--r-- 1 root root 1073741824 Nov  9 13:40 lolz.dmg15
-rw-r--r-- 1 root root  434690997 Nov  9 13:40 lolz.dmg16
-rw-r--r-- 1 root root 1073741824 Nov  9 13:31 lolz.dmg2
-rw-r--r-- 1 root root 1073741824 Nov  9 13:31 lolz.dmg3
-rw-r--r-- 1 root root 1073741824 Nov  9 13:32 lolz.dmg4
-rw-r--r-- 1 root root 1073741824 Nov  9 13:32 lolz.dmg5
-rw-r--r-- 1 root root 1073741824 Nov  9 13:33 lolz.dmg6
-rw-r--r-- 1 root root 1073741824 Nov  9 13:34 lolz.dmg7
-rw-r--r-- 1 root root 1073741824 Nov  9 13:35 lolz.dmg8
-rw-r--r-- 1 root root 1073741824 Nov  9 13:35 lolz.dmg9
(root@sw178) zs3 > time for i in `seq 1 16`; do cat lolz.dmg$i >> newlolz.dmg; done

real    10m55.629s
user    0m4.047s
sys     0m42.704s

(root@sw178) zs3 > md5sum lolz.dmg newlolz.dmg
e9b776914d65da41730265371a84d279  lolz.dmg
e9b776914d65da41730265371a84d279  newlolz.dmg

Should I just ignore the warning, or will this come back to bite me in some situation(s) ? This is my new code:


#!/usr/bin/perl -w

use strict;
use warnings;

my $part = 1;
my @file = @ARGV;  ### the files to split
my $chunk = 1073741824; #1gb
my ($buffer, $size);

foreach ( @file ) {

    #- open the input file
    open my $in_fh, $_ or warn "Cannot read $_: $!";
    binmode $in_fh;

    #- for all but the last part, read
    #- the amount of data, then output to file

    my $sizeRead = $chunk;
    while ($sizeRead == $chunk)
    {
      #- read an output file worth of data
      $sizeRead = read $in_fh, $buffer, $chunk;
      die "Error reading: $!\n" unless defined $sizeRead;

      #- write the output file
      open my $fh, "> $_$part" or warn "Cannot write to $_$part: $!";
      print $fh $buffer;

      #- increment counter for part#
      $part++;
    }

    #- for the last part, read the rest of
    #- the file.

    read $in_fh, my $buffer, -s or warn "Read zero bytes from $_: $!";
    open my $fh, "> $_$part" or warn "Cannot write to $_$part: $!";
    print $fh $buffer;
}

In reply to Chunking very large files by hiptoss

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.