comment on

Greetings,

I have a number of tasks that require reading and processing of run length encoded files (for the lack of a better name). These files are composed of records of various number of fields, with each field being preceded by a single byte that determines the fields length. (Also my code needs to handle both gziped and straight text files transparently.) After using several hand coded methods for different situations I decided that my life would be less stressy if I could write a general routine that would do this quite fast. (Often these files are very very large). My end solution is below.

Im wondering if the monks out there can spot any mistakes in my code, and also if they can suggest any improvements to how I did it. I feel sure that some guru out there could do this far more elegantly.

Oh yes, just to stave any questions about why I didnt use unpack("C/a",$string), the answer is that I dont i know if the file has been terminated incorrectly (ie in the middle of a record) which means that afaict it isn usefull in this situation (becuase pack doesnt say how long the field was supposed to be, just returns whatever was there regardless). Also I'm using a callback paradigm to handle the records extracted from the file. But instead of calling the callback once per record the callback is passed as many records as possible at once. (basically whatever was extracted from the buffer since the last time the callback was invoked). This is done to minimize the subroutine call overhead as some of the files have 10s of millions of records in them.

BTW, I havent that much need to work with buffering file scenarios, so if I have made any obvious mistakes or you can suggest improvements then please do.

read_rle_file(FILE,NUMFIELDS,CALLBACK)

Reads an RLE FILE in either gzipped or straight text format, coverting the items contained into records of NUMFIELDS fields. When a chunk of records have been processed it calls CALLBACK with a reference to the array of records. Returns in scalar context a b64 SHA1 Digest of the file, in list context returns the digest as well as the number of records extracted. Dies on error.

use strict;
use warnings;
use IO::File;
use IO::Zlib;
use Digest::SHA1;
our %Config=(buffer_size=>65535);

#Read run length encoded file
sub read_rle_file {
    my $filespec =shift;  # file to read
    my $numfields=shift;  # number of fields to a record
    my $sub      =shift;  # a CODE ref to call with a chunk of records

    my $sha1=Digest::SHA1->new();
    my $IN_IO;
    print "Reading run length encoded file $filespec, with records of 
+$numfields fields.\n"
        if $Debug;
    if ( $filespec =~ /\.gz/ ) {
        $IN_IO = IO::Zlib->new( $filespec, "rb" )
            or die "Cannot open compressed run length encoded file $fi
+lespec : \n" ;
    } else {
        $IN_IO = IO::File->new($filespec)
            or die "Cannot open run length encoded file $filespec : \n
+" ;
        binmode $IN_IO;
    }

    my $buffer=""; # The buffer we are using
    my $records=0; # Number of record we have read
    my $buffers=0; # The number of times we have refilled the buffer
    my $bytes  =0; # The number of bytes we have read so far

    # read until the file is empty
    while (!$IN_IO->eof ) {

        my $read_buffer;
        my $bytesread = $IN_IO->read( $read_buffer, $Config{buffer_siz
+e} );
        die "Read error in read_rle_file($filespec,$numfields)\n"
            unless defined $bytesread;
        $bytes+=$bytesread;
        $sha1->add($read_buffer);
        $buffer.=$read_buffer;

        my @records;
        # try to extract as many records as possible from the buffer
        BUFFER:
        while ($buffer) {
                        # Keep extracting records as long as the buffe
+r isnt empty
            my @record;
            while (@record<$numfields) {
                                # Extract field by field until we have
+ a complete record.
                my $len=ord(substr($buffer,0,1));
                # do we need to refill the buffer?
                if ($len+1>length $buffer) {
                    # sigh, we do. Put what we've read sofar back into
+ the buffer
                    $buffer=join("",(map{chr(length $_).$_}@record),$b
+uffer);
                    last BUFFER;
                }
                push @record,substr($buffer,1,$len);
                substr($buffer,0,$len+1,"");
            }
            push @records,\@record;
        }
        # hand off to the callback the records we have extracted so fa
+r
        # we do this in chunks to save the callback overhead
        $sub->(\@records);
        $records+=@records;
        print "After buffer ".($buffers++)." read $records records fro
+m $bytes bytes.\n" 
                      if $Debug>1;
    }
    die "Unprocessed data in buffer! read_rle_file($filespec,$numfield
+s) failed!\n"
        if length $buffer;
    return wantarray ? ($sha1->b64digest,$records) : $sha1->b64digest;
}
[download]

Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

In reply to Reading a run length encoded file in a buffering scenario by demerphq

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.