Re: Can I split a 10GB file into 1 GB sizes using my repeating data pattern

The GNU utility split can split on line counts, which seems to be very close to what you want - just divide by the number of chars per line (+1 for the line feed or +2 if using carriage return as well). See the split man page, option -l.

Update: Sorry - missed the criteria that each first line must begin with 100.

Untested and needs tidying:

use strict;
use warnings;

my $MAX_FILE_SIZE = 10_000_000_000;

my $num = 0;
my $next_outfile = sub {
    open my $OUT, '>', 'file_' . (++$num) or die $!;
    return $OUT;
}

my $OUTPUT;
my $curr_size;
my $process_chunk = sub {
    my $chunk = shift;
    if(not defined $curr_size 
        or $curr_size + length($chunk) > $MAX_FILE_SIZE) {
        $OUTPUT = $next_outfile->();
        $curr_size = 0;
    }
    $curr_size += length($chunk);
    print $OUTPUT $chunk;
};

my $chunk;
while(my $line = <INPUT>) {
    if($line =~ /^100/) {
        $process_chunk->($chunk);
        $chunk = '';
    }
    $chunk .= $line;
}
[download]

Reading guide: code is best understood by starting with the while loop at the bottom.

Comment on Re: Can I split a 10GB file into 1 GB sizes using my repeating data pattern Download Code

Replies are listed 'Best First'.
Re^2: Can I split a 10GB file into 1 GB sizes using my repeating data pattern by Limbic~Region (Chancellor) on Jul 22, 2009 at 22:35 UTC
mzedeler, missed the criteria that each first line must begin with 100 I think you also missed the criteria about splitting a 10GB into 1 GB chunks. Also, in the OP's own words "I’m not a Perl guy so excuses my ignorance… I’m the database ETL guy. Not sure if perl is the right choice either." I think your code refs and closures are cool but I am not sure they will be of much benefit to the OP if they need to tweak anything. I don't want this to come off as an admonishment, I am just doubtful that the OP could take your code and run with it. Cheers - L~R	[reply]

Replies are listed 'Best First'.

Re^2: Can I split a 10GB file into 1 GB sizes using my repeating data pattern
by Limbic~Region (Chancellor) on Jul 22, 2009 at 22:35 UTC

mzedeler

missed the criteria that each first line must begin with 100

I think you also missed the criteria about splitting a 10GB into 1 GB chunks. Also, in the OP's own words

I’m not a Perl guy so excuses my ignorance… I’m the database ETL guy. Not sure if perl is the right choice either.

I think your code refs and closures are cool but I am not sure they will be of much benefit to the OP if they need to tweak anything. I don't want this to come off as an admonishment, I am just doubtful that the OP could take your code and run with it.

Cheers - L~R

[reply]