in reply to Can I split a 10GB file into 1 GB sizes using my repeating data pattern

The GNU utility split can split on line counts, which seems to be very close to what you want - just divide by the number of chars per line (+1 for the line feed or +2 if using carriage return as well). See the split man page, option -l.

Update: Sorry - missed the criteria that each first line must begin with 100.

Untested and needs tidying:

use strict; use warnings; my $MAX_FILE_SIZE = 10_000_000_000; my $num = 0; my $next_outfile = sub { open my $OUT, '>', 'file_' . (++$num) or die $!; return $OUT; } my $OUTPUT; my $curr_size; my $process_chunk = sub { my $chunk = shift; if(not defined $curr_size or $curr_size + length($chunk) > $MAX_FILE_SIZE) { $OUTPUT = $next_outfile->(); $curr_size = 0; } $curr_size += length($chunk); print $OUTPUT $chunk; }; my $chunk; while(my $line = <INPUT>) { if($line =~ /^100/) { $process_chunk->($chunk); $chunk = ''; } $chunk .= $line; }

Reading guide: code is best understood by starting with the while loop at the bottom.

  • Comment on Re: Can I split a 10GB file into 1 GB sizes using my repeating data pattern
  • Download Code

Replies are listed 'Best First'.
Re^2: Can I split a 10GB file into 1 GB sizes using my repeating data pattern
by Limbic~Region (Chancellor) on Jul 22, 2009 at 22:35 UTC
    mzedeler,
    missed the criteria that each first line must begin with 100

    I think you also missed the criteria about splitting a 10GB into 1 GB chunks. Also, in the OP's own words

    "I’m not a Perl guy so excuses my ignorance… I’m the database ETL guy. Not sure if perl is the right choice either."

    I think your code refs and closures are cool but I am not sure they will be of much benefit to the OP if they need to tweak anything. I don't want this to come off as an admonishment, I am just doubtful that the OP could take your code and run with it.

    Cheers - L~R