JayDog has asked for the wisdom of the Perl Monks concerning the following question:

O ye with great PERLs of wisdom...

In the code below you will see that I am attempting to split a large data file (sometimes in excess of 12-13 GB) into smaller sized chunks (in the example it is 500 KBs). The data is a whole bunch of separate invoices for customers that always begin with an "11" in the 67th character of the first line for the invoice.

I now want to be able to remove any invoices that are larger than say 25 MB in size and pull them out to their own file, before breaking this large file into the smaller chunks as it is doing currently.
my $chunksize = 500 * 1024; # 500Kb my $filenumber = 0; my $infile = "infile.dat"; my $outsize = 0; my $eof = 0; open INFILE, $infile; open OUTFILE, ">outfile_".$filenumber.".dat"; while(<INFILE>) { chomp; $outsize++; if( $outsize>$chunksize and /^.{67}11/ ) { close OUTFILE; $outsize = 0; $filenumber++; open (OUTFILE, ">outfile_".$filenumber.".dat") or die +"Can't open outfile_".$filenumber.".dat"; } print OUTFILE "$_\n"; $outsize += length; } close INFILE;

Replies are listed 'Best First'.
Re: Removing Large Invoices from a Data File
by BrowserUk (Patriarch) on Jun 01, 2007 at 22:49 UTC

    One problem with your existing code is that you are counting the lines $outsize++;, not the bytes, and comparing that line count against your chunksize if( $outsize>$chunksize ..., which means your 'small' files are going to end up containing 500,000 lines! If your lines average 67 byte per line, then each output file would be around 33MB.

    You need to change that to be $outsize += length() + 1; to achieve your original aim.

    Update: I noticed the $outsize++; at the top, but not the $outsize += length at the bottom of the loop. Why two statements?

    Beyond that, to achieve your second aim, you will need to buffer the contents of each file and accumulate a second per file count of the bytes read, and delay writing until you have accumulated a complete invoice.

    Only at that point will you be able to determine whether to write the buffer out to the current composite file, or to a separate file, dependant upon how big it is. Remember to subtract the accumulated per file count from $outsize when you write individual files. Or only accumulate to $outsize when you have written to the composite output file.

    It would also save time if you tested for the start-of-invoice condition using substr, eg.

    if( substr( $_, 67, 2 ) eq '11' ) { ...

    It gets expensive running up the regex engine on every line of files of this size.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Removing Large Invoices from a Data File
by GrandFather (Saint) on Jun 01, 2007 at 22:48 UTC

    Unless you want special handling for the large invoices, it looks like your code is already doing that. Consider the following mutation of your code for demonstration purposes:

    use warnings; use strict; my $chunksize = 50; my $filenumber = 0; my $outsize = $chunksize + 1; my $eof = 0; while (<DATA>) { if ($outsize > $chunksize and /^11/) { $outsize = 0; $filenumber++; print "*** New file number $filenumber\n"; } print "$_"; $outsize += length; } __DATA__ 11 The first invoice. We get this whole thing even though it exceeds the 50 character limit because the conditional code requires an invoice marker before it will start a new file. 11 Second 11 Third 11 Fourth invoice 11 Fifth invoice This one is longer, but not excessive 11 Sixth: break before because limit hit in fifth 11 seventh and final

    Prints:

    *** New file number 1 11 The first invoice. We get this whole thing even though it exceeds the 50 character limit because the conditional code requires an invoice marker before it will start a new file. *** New file number 2 11 Second 11 Third 11 Fourth invoice 11 Fifth invoice This one is longer, but not excessive *** New file number 3 11 Sixth: break before because limit hit in fifth 11 seventh and final

    How would the output be different to achieve what you want?


    DWIM is Perl's answer to Gödel
Re: Removing Large Invoices from a Data File
by Limbic~Region (Chancellor) on Jun 02, 2007 at 03:30 UTC
    JayDog,
    If your goal is to place as many invoices into a single file up to a certain size limit and ones that exceed that size limit in their own file, then this is how I would do it (untested):

    Cheers - L~R

Re: Removing Large Invoices from a Data File
by Util (Priest) on Jun 02, 2007 at 03:48 UTC
    Here is my (loosely tested) solution. The two layers of buffering turned out to be close to BrowserUK's description. The use of a separate write_invoice sub makes the separation of the two buffers explicit.