demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I have a number of tasks that require reading and processing of run length encoded files (for the lack of a better name). These files are composed of records of various number of fields, with each field being preceded by a single byte that determines the fields length. (Also my code needs to handle both gziped and straight text files transparently.) After using several hand coded methods for different situations I decided that my life would be less stressy if I could write a general routine that would do this quite fast. (Often these files are very very large). My end solution is below.

Im wondering if the monks out there can spot any mistakes in my code, and also if they can suggest any improvements to how I did it. I feel sure that some guru out there could do this far more elegantly.

Oh yes, just to stave any questions about why I didnt use unpack("C/a",$string), the answer is that I dont i know if the file has been terminated incorrectly (ie in the middle of a record) which means that afaict it isn usefull in this situation (becuase pack doesnt say how long the field was supposed to be, just returns whatever was there regardless). Also I'm using a callback paradigm to handle the records extracted from the file. But instead of calling the callback once per record the callback is passed as many records as possible at once. (basically whatever was extracted from the buffer since the last time the callback was invoked). This is done to minimize the subroutine call overhead as some of the files have 10s of millions of records in them.

BTW, I havent that much need to work with buffering file scenarios, so if I have made any obvious mistakes or you can suggest improvements then please do.

read_rle_file(FILE,NUMFIELDS,CALLBACK)

Reads an RLE FILE in either gzipped or straight text format, coverting the items contained into records of NUMFIELDS fields. When a chunk of records have been processed it calls CALLBACK with a reference to the array of records. Returns in scalar context a b64 SHA1 Digest of the file, in list context returns the digest as well as the number of records extracted. Dies on error.

use strict; use warnings; use IO::File; use IO::Zlib; use Digest::SHA1; our %Config=(buffer_size=>65535); #Read run length encoded file sub read_rle_file { my $filespec =shift; # file to read my $numfields=shift; # number of fields to a record my $sub =shift; # a CODE ref to call with a chunk of records my $sha1=Digest::SHA1->new(); my $IN_IO; print "Reading run length encoded file $filespec, with records of +$numfields fields.\n" if $Debug; if ( $filespec =~ /\.gz/ ) { $IN_IO = IO::Zlib->new( $filespec, "rb" ) or die "Cannot open compressed run length encoded file $fi +lespec : \n" ; } else { $IN_IO = IO::File->new($filespec) or die "Cannot open run length encoded file $filespec : \n +" ; binmode $IN_IO; } my $buffer=""; # The buffer we are using my $records=0; # Number of record we have read my $buffers=0; # The number of times we have refilled the buffer my $bytes =0; # The number of bytes we have read so far # read until the file is empty while (!$IN_IO->eof ) { my $read_buffer; my $bytesread = $IN_IO->read( $read_buffer, $Config{buffer_siz +e} ); die "Read error in read_rle_file($filespec,$numfields)\n" unless defined $bytesread; $bytes+=$bytesread; $sha1->add($read_buffer); $buffer.=$read_buffer; my @records; # try to extract as many records as possible from the buffer BUFFER: while ($buffer) { # Keep extracting records as long as the buffe +r isnt empty my @record; while (@record<$numfields) { # Extract field by field until we have + a complete record. my $len=ord(substr($buffer,0,1)); # do we need to refill the buffer? if ($len+1>length $buffer) { # sigh, we do. Put what we've read sofar back into + the buffer $buffer=join("",(map{chr(length $_).$_}@record),$b +uffer); last BUFFER; } push @record,substr($buffer,1,$len); substr($buffer,0,$len+1,""); } push @records,\@record; } # hand off to the callback the records we have extracted so fa +r # we do this in chunks to save the callback overhead $sub->(\@records); $records+=@records; print "After buffer ".($buffers++)." read $records records fro +m $bytes bytes.\n" if $Debug>1; } die "Unprocessed data in buffer! read_rle_file($filespec,$numfield +s) failed!\n" if length $buffer; return wantarray ? ($sha1->b64digest,$records) : $sha1->b64digest; }

Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Replies are listed 'Best First'.
Re: Reading a run length encoded file in a buffering scenario
by jmcnamara (Monsignor) on Aug 14, 2002 at 14:16 UTC

    I still prefer the unpack 'C/a' method, not least of all for speed. How about something like this:
    #!/usr/bin/perl -wl my $buffer = "\04perl\03awk\01C"; # Try this incomplete string # my $buffer = "\04perl\03awk\01C\6pytho"; while ((my $size = ord $buffer) < length $buffer) { print unpack 'C/a', $buffer; substr $buffer, 0, $size +1, ''; } print "<$buffer>";

    This will leave any incomplete items in the buffer.

    --
    John.

      This will leave any incomplete items in the buffer.

      not exactly. more like: if the last item is incomplete, it will remain in the buffer.

      there's an important distinction. if any item other than the last in incomplete, every item from the incomplete item to the last item (inclusive) will be corrupt. there may or may not be data remaining in the buffer as well.

      ~Particle *accelerates*


        there's an important distinction. if any item other than the last in incomplete, every item from the incomplete item to the last item (inclusive) will be corrupt. there may or may not be data remaining in the buffer as well.

        I think that this comment serves to confuse rather than clarify.

        There is no important distinction to be made here. If the data is corrupt then all decoding schemes will fail.

        If the data isn't corrupt then having the incomplete item remain in the buffer is an advantage. It means that the program can add to the buffer until at least one record is read.

        --
        John.

      Well I used your code, with some modifications to produce:
      sub read_rle_file {#read run length encoded file my $filespec=shift; my $numfields=shift; my $sub=shift; my $sha1=Digest::SHA1->new(); my $IN_IO; print "Reading run length encoded file $filespec, with records of +$numfields fields.\n" if $Debug; if ( $filespec =~ /\.gz/ ) { $IN_IO = IO::Zlib->new( $filespec, "rb" ) or die "Cannot open compressed run length encoded file $fi +lespec : \n" ; } else { $IN_IO = IO::File->new($filespec) or die "Cannot open run length encoded file $filespec : \n +" ; binmode $IN_IO; } my $buffer=""; # The buffer we are using my $records=0; # Number of record we have read my $buffers=0; # The number of times we have refilled the buffer my $bytes =0; # The number of bytes we have read so far my $record=[]; # the array of records # read until the file is empty while (!$IN_IO->eof ) { my $read_buffer; my $bytesread = $IN_IO->read( $read_buffer, $Config{buffer_siz +e} ); die "Read error in read_rle_file($filespec,$numfields)\n" unless defined $bytesread; $bytes+=$bytesread; $sha1->add($read_buffer); $buffer.=$read_buffer; my @records; # try to extract as many records as possible from the buffer BUFFER: while ((my $len=ord($buffer)) < length $buffer) { push @$record,unpack("C/a",$buffer); substr($buffer,0,$len+1,""); if (@$record==$numfields) { push @records,$record; $record=[]; } } # hand off to the callback the records we have extracted so fa +r # we do this in chunks to save the callback overhead $sub->(\@records); $records+=@records; print "After buffer ".($buffers++)." read $records records fro +m $bytes bytes.\n" if $Debug>1; } die "Unprocessed data in buffer! read_rle_file($filespec,$numfield +s) failed!\n[@$record] $buffer\n" if @$record || length $buffer; return wantarray ? ($sha1->b64digest,$records) : $sha1->b64digest; }
      And Particles point is valid, but luckily in my situation im not worried about a corrupt file so much as an improperly terminated one. Also note the games with $record to handle when a buffer empties before a full record is complete, instead of pushing the incomplete record back into the buffer I now leave it in $record.

      Anyway, thanks for the feedback.

      Yves / DeMerphq
      ---
      Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Re: Reading a run length encoded file in a buffering scenario
by jmcnamara (Monsignor) on Aug 14, 2002 at 14:30 UTC

    I was also playing around with a s/// solution whereby a character would be matched followed by that number of characters. Something like this psuedocode:
    my $buffer = "\04perl\03awk\01C"; my @a = $buffer =~ /(.)(.{ord \1})/g;
    I tried the following but the assignment of $1 or \1 didn't have an effect.
    use re 'eval'; my $buffer = "\04perl\03awk\01C"; my @a = $buffer =~ /(.)(?{$len = ord $1})(.{$len})/g;

    Anyone have any ideas about how this could be made to work?

    --
    John.

      A bit of playing around lead me to this:
      use re 'eval'; my $buffer = "\04perl\03awk\01C"; while ($buffer =~ /\G(.)((??{".{".ord($1)."}"}))/gs) { print ord($1),$2,"\n"; }
      Which seems to work ok. But its cryptic as all get out! (And uses a feature that perlre says is "highly experimental")

      Yves / DeMerphq
      ---
      Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)