Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I’m not a Perl guy so excuses my ignorance… I’m the database ETL guy. Not sure if perl is the right choice either.

I’m trying to be proactive and devise a plan B for a ETL process where I expect a file 10X larger than what I process daily for a recast job. The ETL may handle it but I just don’t know.

This file may need to be split and we don’t want to lose related data. I assume it would be easier to do it at the unix scripting level rather than the etl tool providing there are no limitations to file sizes with Perl.

The file will most likely be 10GB +- a few GB. It is unknown at this time

The basic file format is as follows with the first 3 characters of each line being the record type (100,401,404,410,411)

The file must be split into segments equal to a daily run approximately 1gb in size and it has to occur just before a 100 record as all the rows that follow a 100 belong together.

1001104vvbvnbvd 4011104ghghghgh 404111kjdkfjkdf 404111kjdkfjkdf 404111kjdkfjkdf 404111kjdkfjkdf 4103445kkjkljlk 4103445kkjkljlk 4113445kkjkljlk 4043445kkjkljlk 10011ffgfgg1250 4011104fffhghgh 404111kjddfjkdf 404111kjdkrtrdf etc...skip ahead 1gb 10011ffgfger250 <---- break and start file 2 for the next 1GB 40111034efhghgh 404111kjddfjkdf 404111kjdkrtrdf
thanks in advance.
  • Comment on Can I split a 10GB file into 1 GB sizes using my repeating data pattern
  • Download Code

Replies are listed 'Best First'.
Re: Can I split a 10GB file into 1 GB sizes using my repeating data pattern
by SuicideJunkie (Vicar) on Jul 22, 2009 at 18:34 UTC

    As long as you don't try to read the whole file into memory at once, you will be fine.

    Just read a line at a time and open a new outputfile whenever you see that the size is too big and the current line starts with "100". Something like this:
    while (defined $line = <$inputFileHandle>)

    while ($line = <$inputFileHandle>) { if ($sizeSoFar > 1e9 and $line =~ /^100/) { $outputFileName++; open $outputFileHandle, '>', $outputFileName or die "Cannot open $ +outputFilename for writing: $!\n"; $sizeSoFar = 0; } print $outputFilehandle $line; $sizeSoFar += length($line); }

      while (defined $line = <$inputFileHandle>)

      Because the  = operator has higher precedence than the defined operator you need to either enclose the assignment in parentheses:

      while (defined( $line = <$inputFileHandle> ))

      Or because perl does the defined test by default then just omit it and perl will do the right thing:

      while ($line = <$inputFileHandle>)

Re: Can I split a 10GB file into 1 GB sizes using my repeating data pattern
by Limbic~Region (Chancellor) on Jul 22, 2009 at 22:24 UTC
    Anonymous Monk,
    This is an incredibly trivial task in perl. Here is some code (untested) that will get you started. I have intentionally left some things unoptimized with comments since you know better than I what should actually happen.
    #!/usr/bin/perl use constant DAILY_RUN => 1024 * 1024 * 1000; use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input_file>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $cnt = 1; my $out_file = "$file.$cnt"; # Will clobber an existing file by this name (fix if important) open(my $out_fh, '>', $out_file) or die "Unable to open '$out_file' fo +r writing: $!"; while (<$fh>) { if (-s $out_file > DAILY_RUN && /^100/) { ++$cnt; $out_file = "$file.$cnt"; open($out_fh, '>', $out_file) or die "Unable to open '$out_fil +e' for writing: $!"; } print $out_fh $_; }

    Now it looks like your lines are fixed length so one optimization may be not to check after every write how big the file is but to wait until you have written enough to be at least 1 GB and set a flag to pay attention to the start of a 100 record. Additionally, this code writes at least 1GB and then starts a new file as soon as a 100 record is encountered - you may want to keep it under 1GB. Again, you are in a better position to address these than I am. Finally, it may be possible to process the record sets as a whole rather than a line at a time by setting $/ = "\n100"; That is an advanced technique that you can read about in perlvar. It complicates the code but it is presumably more efficient (less disk read/writes)

    Cheers - L~R

      That worked perfectly Limbic~Region Thanks for your help... I hope I don't have to use it, as that means my initial ETL had crashed. This little exercise was interesting.. I did alot of research and had many interesting results (unsuccessful) from my own scripts. I'm going to try and adjust your script to see if I can add a header and footer.
Re: Can I split a 10GB file into 1 GB sizes using my repeating data pattern
by mzedeler (Pilgrim) on Jul 22, 2009 at 18:35 UTC

    The GNU utility split can split on line counts, which seems to be very close to what you want - just divide by the number of chars per line (+1 for the line feed or +2 if using carriage return as well). See the split man page, option -l.

    Update: Sorry - missed the criteria that each first line must begin with 100.

    Untested and needs tidying:

    use strict; use warnings; my $MAX_FILE_SIZE = 10_000_000_000; my $num = 0; my $next_outfile = sub { open my $OUT, '>', 'file_' . (++$num) or die $!; return $OUT; } my $OUTPUT; my $curr_size; my $process_chunk = sub { my $chunk = shift; if(not defined $curr_size or $curr_size + length($chunk) > $MAX_FILE_SIZE) { $OUTPUT = $next_outfile->(); $curr_size = 0; } $curr_size += length($chunk); print $OUTPUT $chunk; }; my $chunk; while(my $line = <INPUT>) { if($line =~ /^100/) { $process_chunk->($chunk); $chunk = ''; } $chunk .= $line; }

    Reading guide: code is best understood by starting with the while loop at the bottom.

      mzedeler,
      missed the criteria that each first line must begin with 100

      I think you also missed the criteria about splitting a 10GB into 1 GB chunks. Also, in the OP's own words

      "I’m not a Perl guy so excuses my ignorance… I’m the database ETL guy. Not sure if perl is the right choice either."

      I think your code refs and closures are cool but I am not sure they will be of much benefit to the OP if they need to tweak anything. I don't want this to come off as an admonishment, I am just doubtful that the OP could take your code and run with it.

      Cheers - L~R