Reading a run length encoded file in a buffering scenario

demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I have a number of tasks that require reading and processing of run length encoded files (for the lack of a better name). These files are composed of records of various number of fields, with each field being preceded by a single byte that determines the fields length. (Also my code needs to handle both gziped and straight text files transparently.) After using several hand coded methods for different situations I decided that my life would be less stressy if I could write a general routine that would do this quite fast. (Often these files are very very large). My end solution is below.

Im wondering if the monks out there can spot any mistakes in my code, and also if they can suggest any improvements to how I did it. I feel sure that some guru out there could do this far more elegantly.

Oh yes, just to stave any questions about why I didnt use unpack("C/a",$string), the answer is that I dont i know if the file has been terminated incorrectly (ie in the middle of a record) which means that afaict it isn usefull in this situation (becuase pack doesnt say how long the field was supposed to be, just returns whatever was there regardless). Also I'm using a callback paradigm to handle the records extracted from the file. But instead of calling the callback once per record the callback is passed as many records as possible at once. (basically whatever was extracted from the buffer since the last time the callback was invoked). This is done to minimize the subroutine call overhead as some of the files have 10s of millions of records in them.

BTW, I havent that much need to work with buffering file scenarios, so if I have made any obvious mistakes or you can suggest improvements then please do.

read_rle_file(FILE,NUMFIELDS,CALLBACK)

Reads an RLE FILE in either gzipped or straight text format, coverting the items contained into records of NUMFIELDS fields. When a chunk of records have been processed it calls CALLBACK with a reference to the array of records. Returns in scalar context a b64 SHA1 Digest of the file, in list context returns the digest as well as the number of records extracted. Dies on error.

use strict;
use warnings;
use IO::File;
use IO::Zlib;
use Digest::SHA1;
our %Config=(buffer_size=>65535);

#Read run length encoded file
sub read_rle_file {
    my $filespec =shift;  # file to read
    my $numfields=shift;  # number of fields to a record
    my $sub      =shift;  # a CODE ref to call with a chunk of records

    my $sha1=Digest::SHA1->new();
    my $IN_IO;
    print "Reading run length encoded file $filespec, with records of 
+$numfields fields.\n"
        if $Debug;
    if ( $filespec =~ /\.gz/ ) {
        $IN_IO = IO::Zlib->new( $filespec, "rb" )
            or die "Cannot open compressed run length encoded file $fi
+lespec : \n" ;
    } else {
        $IN_IO = IO::File->new($filespec)
            or die "Cannot open run length encoded file $filespec : \n
+" ;
        binmode $IN_IO;
    }

    my $buffer=""; # The buffer we are using
    my $records=0; # Number of record we have read
    my $buffers=0; # The number of times we have refilled the buffer
    my $bytes  =0; # The number of bytes we have read so far

    # read until the file is empty
    while (!$IN_IO->eof ) {

        my $read_buffer;
        my $bytesread = $IN_IO->read( $read_buffer, $Config{buffer_siz
+e} );
        die "Read error in read_rle_file($filespec,$numfields)\n"
            unless defined $bytesread;
        $bytes+=$bytesread;
        $sha1->add($read_buffer);
        $buffer.=$read_buffer;

        my @records;
        # try to extract as many records as possible from the buffer
        BUFFER:
        while ($buffer) {
                        # Keep extracting records as long as the buffe
+r isnt empty
            my @record;
            while (@record<$numfields) {
                                # Extract field by field until we have
+ a complete record.
                my $len=ord(substr($buffer,0,1));
                # do we need to refill the buffer?
                if ($len+1>length $buffer) {
                    # sigh, we do. Put what we've read sofar back into
+ the buffer
                    $buffer=join("",(map{chr(length $_).$_}@record),$b
+uffer);
                    last BUFFER;
                }
                push @record,substr($buffer,1,$len);
                substr($buffer,0,$len+1,"");
            }
            push @records,\@record;
        }
        # hand off to the callback the records we have extracted so fa
+r
        # we do this in chunks to save the callback overhead
        $sub->(\@records);
        $records+=@records;
        print "After buffer ".($buffers++)." read $records records fro
+m $bytes bytes.\n" 
                      if $Debug>1;
    }
    die "Unprocessed data in buffer! read_rle_file($filespec,$numfield
+s) failed!\n"
        if length $buffer;
    return wantarray ? ($sha1->b64digest,$records) : $sha1->b64digest;
}
[download]

Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Comment on Reading a run length encoded file in a buffering scenario Download Code

Replies are listed 'Best First'.
Re: Reading a run length encoded file in a buffering scenario by jmcnamara (Monsignor) on Aug 14, 2002 at 14:16 UTC
I still prefer the `unpack 'C/a'` method, not least of all for speed. How about something like this: `#!/usr/bin/perl -wl my $buffer = "\04perl\03awk\01C"; # Try this incomplete string # my $buffer = "\04perl\03awk\01C\6pytho"; while ((my $size = ord $buffer) < length $buffer) { print unpack 'C/a', $buffer; substr $buffer, 0, $size +1, ''; } print "<$buffer>";` [download] This will leave any incomplete items in the buffer. -- John.	[reply] [d/l]
Re^2: Reading a run length encoded file in a buffering scenario by particle (Vicar) on Aug 14, 2002 at 14:34 UTC
This will leave any incomplete items in the buffer. not exactly. more like: if the last item is incomplete, it will remain in the buffer. there's an important distinction. if any item other than the last in incomplete, every item from the incomplete item to the last item (inclusive) will be corrupt. there may or may not be data remaining in the buffer as well. ~Particle accelerates	[reply] [d/l]
Re: Re^2: Reading a run length encoded file in a buffering scenario by jmcnamara (Monsignor) on Aug 15, 2002 at 09:03 UTC
there's an important distinction. if any item other than the last in incomplete, every item from the incomplete item to the last item (inclusive) will be corrupt. there may or may not be data remaining in the buffer as well. I think that this comment serves to confuse rather than clarify. There is no important distinction to be made here. If the data is corrupt then all decoding schemes will fail. If the data isn't corrupt then having the incomplete item remain in the buffer is an advantage. It means that the program can add to the buffer until at least one record is read. -- John.	[reply]
Re: Re: Reading a run length encoded file in a buffering scenario by demerphq (Chancellor) on Aug 15, 2002 at 08:29 UTC
Well I used your code, with some modifications to produce: sub read_rle_file {#read run length encoded file my $filespec=shift; my $numfields=shift; my $sub=shift; my $sha1=Digest::SHA1->new(); my $IN_IO; print "Reading run length encoded file $filespec, with records of +$numfields fields.\n" if $Debug; if ( $filespec =~ /\.gz/ ) { $IN_IO = IO::Zlib->new( $filespec, "rb" ) or die "Cannot open compressed run length encoded file $fi +lespec : \n" ; } else { $IN_IO = IO::File->new($filespec) or die "Cannot open run length encoded file $filespec : \n +" ; binmode $IN_IO; } my $buffer=""; # The buffer we are using my $records=0; # Number of record we have read my $buffers=0; # The number of times we have refilled the buffer my $bytes =0; # The number of bytes we have read so far my $record=[]; # the array of records # read until the file is empty while (!$IN_IO->eof ) { my $read_buffer; my $bytesread = $IN_IO->read( $read_buffer, $Config{buffer_siz +e} ); die "Read error in read_rle_file($filespec,$numfields)\n" unless defined $bytesread; $bytes+=$bytesread; $sha1->add($read_buffer); $buffer.=$read_buffer; my @records; # try to extract as many records as possible from the buffer BUFFER: while ((my $len=ord($buffer)) < length $buffer) { push @$record,unpack("C/a",$buffer); substr($buffer,0,$len+1,""); if (@$record==$numfields) { push @records,$record; $record=[]; } } # hand off to the callback the records we have extracted so fa +r # we do this in chunks to save the callback overhead $sub->(\@records); $records+=@records; print "After buffer ".($buffers++)." read $records records fro +m $bytes bytes.\n" if $Debug>1; } die "Unprocessed data in buffer! read_rle_file($filespec,$numfield +s) failed!\n[@$record] $buffer\n" if @$record \|\| length $buffer; return wantarray ? ($sha1->b64digest,$records) : $sha1->b64digest; } [download] And Particles point is valid, but luckily in my situation im not worried about a corrupt file so much as an improperly terminated one. Also note the games with $record to handle when a buffer empties before a full record is complete, instead of pushing the incomplete record back into the buffer I now leave it in $record. Anyway, thanks for the feedback. Yves / DeMerphq --- Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)	[reply] [d/l]
Re: Reading a run length encoded file in a buffering scenario by jmcnamara (Monsignor) on Aug 14, 2002 at 14:30 UTC
I was also playing around with a `s///` solution whereby a character would be matched followed by that number of characters. Something like this psuedocode: `my $buffer = "\04perl\03awk\01C"; my @a = $buffer =~ /(.)(.{ord \1})/g;` [download] I tried the following but the assignment of `$1` or `\1` didn't have an effect. `use re 'eval'; my $buffer = "\04perl\03awk\01C"; my @a = $buffer =~ /(.)(?{$len = ord $1})(.{$len})/g;` [download] Anyone have any ideas about how this could be made to work? -- John.	[reply] [d/l] [select]
Re: Re: Reading a run length encoded file in a buffering scenario by demerphq (Chancellor) on Aug 15, 2002 at 08:45 UTC
A bit of playing around lead me to this: `use re 'eval'; my $buffer = "\04perl\03awk\01C"; while ($buffer =~ /\G(.)((??{".{".ord($1)."}"}))/gs) { print ord($1),$2,"\n"; }` [download] Which seems to work ok. But its cryptic as all get out! (And uses a feature that perlre says is "highly experimental") Yves / DeMerphq --- Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)	[reply] [d/l]