Grundle has asked for the wisdom of the Perl Monks concerning the following question:

I have written a simple subroutine to break a large file up into smaller increments. Since the large file contains records that are a certain blocksize I have abstracted the program to load the record block-size of that file and to calculate how many records can be written to each smaller file based on the upper limit file size. For some reason as I parse, there are several files that have written past the specified limit for my sub-files. I do not understand this behavior, especially when I am specifically reading a set number of bytes into my buffer. Please see the following code
my $increment = 1000000; #For this example $$config{blocksize} == 550 my $records = int($increment / $$config{blocksize}); my $bytecnt = $$config{blocksize} * $records; #After caluclations this prints out as 999900 print "bytecnt[$bytecnt]\n"; my $file = 0; open PARENT, $$self{data} or die "Cannot open [$$self{data}] for i +ncremental parsing\n"; while(1){ $file++; my $data = ""; if($file == 1){ read(PARENT, $data, $$config{startblock}); read(PARENT, $data, $bytecnt); open(FILE, ">tmp\\$$self{process}".sprintf("%02d", $file). +".dat") or die "Cannot open tmp\\$$self{process}".sprintf("%02d", $fi +le).".dat for incremental writing\n"; print FILE $data; close(FILE); next; } read(PARENT, $data, $bytecnt); open(FILE, ">tmp\\$$self{process}".sprintf("%02d", $file).".da +t") or die "Cannot open tmp\\$$self{process}".sprintf("%02d", $file). +".dat for incremental writing\n"; print FILE $data; close(FILE); if(eof(PARENT)){ last; } } close(PARENT); die;
After this routine runs I do a ls -l on my directory and the following is printed out.
-rwxr-x---+ 1 999946 May 30 09:27 patient01.dat -rwxr-x---+ 1 1000463 May 30 09:27 patient02.dat -rwxr-x---+ 1 999940 May 30 09:27 patient03.dat -rwxr-x---+ 1 999944 May 30 09:27 patient04.dat -rwxr-x---+ 1 999931 May 30 09:27 patient05.dat -rwxr-x---+ 1 999945 May 30 09:27 patient06.dat -rwxr-x---+ 1 1000236 May 30 09:27 patient38.dat -rwxr-x---+ 1 1000122 May 30 09:27 patient39.dat -rwxr-x---+ 1 1000031 May 30 09:27 patient41.dat
There are more files in the ls, but I am putting the files that operate outside the behavior that I expect. My confusion is, if my blocksize is constant, and my file length is constant, then how is it possible to be printing variable file lengths? The only file that should be different in length is the last one...

Replies are listed 'Best First'.
Re: Unexpected File Results
by varian (Chaplain) on May 30, 2007 at 15:12 UTC
    I don't have your datafiles to replay the problem but the code shows me two potential causes (operating system dependent):
    1) you probably want to set binmode on the outputfiles, as to avoid translation of characters
    2) the read call always reserves the right to have read less than the number of bytes you requested it to. Therefore check the returned 'actual bytes read' and see if you need to make another call to read the remaining bytes of the record.
      I can automatically answer statement 2. I threw a print statement in the loop to see the length of each "$dat" buffer, and they came out correctly.

      Suggestions 1 looks very promising. I suspect that this will most likely solve my problem, since the files are in binary.
      Yes, binmode was exactly what I needed, thanks!!!

        Make sure to binmode the input file as well, in case it contains cr-lf pairs which reading would convert to just lf.

Re: Unexpected File Results
by Old_Gray_Bear (Bishop) on May 30, 2007 at 15:16 UTC
    In the field of mathematics there is a very useful tool called "Proof By Contradiction". You assume the converse of what you want to prove and then demonstrate that this leads to a logical contradiction. If your chain of logic is impeccably correct, then your initial assumption must have been wrong; thus demonstrating that its converse (the statement you wanted to prove in the first place) is true. Often this is an easier approach than a flat-out straight-on proof of the correctness of your original theorem.

    There are a couple of possibilities that I see here:

    1. The assumption about a constant physical block-size of the data is in error.
    2. The 'constant block-size' refers to the number of records composing an entry ('every patient will have the following sixteen pieces of information'), but the lengths of the individual records can vary ('fields that do not have data will be entered as a single blank or zero').
    I'd bet that further discussion with the User/Designer of the input will be most instructive. ("Why, yes, we said that the block-size was constant. It isn't? Hum, you have uncovered a bug. Don't do anything more with the data until we can check this out.")

    ----
    I Go Back to Sleep, Now.

    OGB

      I appreciate your mathematical approach, but let me set your mind at ease. For assumption 1, I have proven it out by using the handy tool called "dd" plus another fantastic mathematical convention. By taking the total file length and dividing it by the total number of records contained within I was able to find the blocksize for each record.

      blocksize = filesize/total_num_records

      To further prove this, I know that each block starts with a customer name. So with the handy tool 'dd' I can move to an arbitrary record. If my blocksize is off, then I will not have the name starting the block.

      dd if=filename bs=550 count=1 skip=2000 | od -Ad -c

      The previous statement moves me to block number 2000 and allows me to see one instance of that block (in hex format). The name is at the correct location so we have proven that 1 is not the case.

      For statement number 2 let us refer to the file itself. Since it is in binary format, and since it is a database type file it is logically broken up into these blocks. These blocks in the database world are also called "rows". Although they can have empty locations, the system has allocated these spaces before hand. These previously allocated spaces will still exist in the binary file, even though it is filled with nulls or (in the hexdump) \0.

      One more thing I would like to point out, is that I find it quite useful to approach the problem from a different angle. Most developers would never think to come from the standpoint you have suggested. Thank you for those thoughts
Re: Unexpected File Results
by Util (Priest) on May 30, 2007 at 15:50 UTC

    Testing on non-CygWin (Activestate Win32), trying to replicate the possible test/binary condition. This (highly refactored) code, when reading and writing binary (raw), works perfectly. Changing the output mode back to Text reproduces the behaviour described in the OP.

    use strict; use warnings; my $self = { data => 'PM_618194_in.dat', process => 'patient' }; my $config = { blocksize => 550, startblock => 3 }; my $max_chunk_size = 1_000_000; my $chunk_size = int( $max_chunk_size / $config->{blocksize} ) * $config->{blocksize} ; open my $parent_fh, '<:raw', $self->{data} or die "Cannot open '$self->{data}' for incremental parsing: $!"; my $bytes_read = read $parent_fh, my $junk, $config->{startblock}; if ( $bytes_read != $config->{startblock} ) { warn "Tried to read $config->{startblock} bytes, ". "but got $bytes_read bytes!\n"; } $/ = \$chunk_size; # Set <> for fixed blocksize reads. my $file_num; while ( <$parent_fh> ) { $file_num++; my $filename = sprintf 'tmp/%s%02d.dat', $self->{process}, $file_num; open my $out_fh, '>:raw', $filename # open my $out_fh, '>', $filename or die "Cannot open '$filename' for incremental writing: $!"; print $out_fh $_ or warn; close $out_fh or warn; } close $parent_fh or warn;
    BTW, you can save yourself a good bit of code by factoring out the {startblock} read, and using $/, as I have above. Also, when your files are binary, it is *always* most correct to use binary mode, even when your OS (like Unix) does not care.

Re: Unexpected File Results
by Util (Priest) on May 30, 2007 at 15:16 UTC

    Are you using Win32, perhaps cygWin with some of the files being handled in text mode instead of binary mode? Does doing binmode() on the filehandles make a difference?

    Unrelated possible bug: Are you trying to throw away $config->{startblock} blocks at the start of the input file? Because you are throwing away $config->{startblock} bytes!

Re: Unexpected File Results
by leocharre (Priest) on May 30, 2007 at 14:56 UTC
    Mine works fine..
    #!/usr/bin/perl -w use strict; my $self = {}; $self->{data} = './biglog.log'; $self->{process} = 1; my $increment = 1000000; #For this example $$config{blocksize} == 550 # my $blocksize = 550; my $startblock = 1; my $records = int($increment / $blocksize); my $bytecnt = $blocksize * $records; #After caluclations this prints out as 999900 print "bytecnt[$bytecnt]\n"; my $file = 0; open PARENT, $$self{data} or die "Cannot open [$$self{data}] for i +ncremental parsing\n"; while(1){ $file++; my $data = ""; if($file == 1){ read(PARENT, $data, $startblock); read(PARENT, $data, $bytecnt); open(FILE, ">tmp\\$$self{process}".sprintf("%02d", $file). +".dat") or die "Cannot open tmp\\$$self{process}".sprintf( +"%02d", $file).".dat for incremental writing\n"; print FILE $data; close(FILE); next; } read(PARENT, $data, $bytecnt); open(FILE, ">tmp\\$$self{process}".sprintf("%02d", $file).".da +t") or die "Cannot open tmp\\$$self{process}".sprintf("%02d", +$file).".dat for incremental writing\n"; print FILE $data; close(FILE); if(eof(PARENT)){ last; } } close(PARENT); die;
    I get
    -rw-r--r-- 1 loot poot 6056869 May 30 10:48 biglog.log
    -rw-r--r-- 1 loot poot    1425 May 30 10:51 incspl.pl
    -rw-r--r-- 1 loot poot  999900 May 30 10:52 tmp\101.dat
    -rw-r--r-- 1 loot poot  999900 May 30 10:52 tmp\102.dat
    -rw-r--r-- 1 loot poot  999900 May 30 10:52 tmp\103.dat
    -rw-r--r-- 1 loot poot  999900 May 30 10:52 tmp\104.dat
    -rw-r--r-- 1 loot poot  999900 May 30 10:52 tmp\105.dat
    -rw-r--r-- 1 loot poot  999900 May 30 10:52 tmp\106.dat
    -rw-r--r-- 1 loot poot   57468 May 30 10:52 tmp\107.dat
    
    Your problem must be the config??.. (Oh no.. could it be because you're using windoz some funny filesystem- like.. ?!?!)
      Yes I am using "Windows", with a cygwin interface for my console. Do you think that is causing this strange behavior? Can it be that Windows can't handle block sizes correctly?

      I tried your code to see if I could generate a difference, but the results were still the same.