I have a number of tasks that require reading and processing of run length encoded files (for the lack of a better name). These files are composed of records of various number of fields, with each field being preceded by a single byte that determines the fields length. (Also my code needs to handle both gziped and straight text files transparently.) After using several hand coded methods for different situations I decided that my life would be less stressy if I could write a general routine that would do this quite fast. (Often these files are very very large). My end solution is below.
Im wondering if the monks out there can spot any mistakes in my code, and also if they can suggest any improvements to how I did it. I feel sure that some guru out there could do this far more elegantly.
Oh yes, just to stave any questions about why I didnt use unpack("C/a",$string), the answer is that I dont i know if the file has been terminated incorrectly (ie in the middle of a record) which means that afaict it isn usefull in this situation (becuase pack doesnt say how long the field was supposed to be, just returns whatever was there regardless). Also I'm using a callback paradigm to handle the records extracted from the file. But instead of calling the callback once per record the callback is passed as many records as possible at once. (basically whatever was extracted from the buffer since the last time the callback was invoked). This is done to minimize the subroutine call overhead as some of the files have 10s of millions of records in them.
BTW, I havent that much need to work with buffering file scenarios, so if I have made any obvious mistakes or you can suggest improvements then please do.
read_rle_file(FILE,NUMFIELDS,CALLBACK)
Reads an RLE FILE in either gzipped or straight text format, coverting the items contained into records of NUMFIELDS fields. When a chunk of records have been processed it calls CALLBACK with a reference to the array of records. Returns in scalar context a b64 SHA1 Digest of the file, in list context returns the digest as well as the number of records extracted. Dies on error.
use strict; use warnings; use IO::File; use IO::Zlib; use Digest::SHA1; our %Config=(buffer_size=>65535); #Read run length encoded file sub read_rle_file { my $filespec =shift; # file to read my $numfields=shift; # number of fields to a record my $sub =shift; # a CODE ref to call with a chunk of records my $sha1=Digest::SHA1->new(); my $IN_IO; print "Reading run length encoded file $filespec, with records of +$numfields fields.\n" if $Debug; if ( $filespec =~ /\.gz/ ) { $IN_IO = IO::Zlib->new( $filespec, "rb" ) or die "Cannot open compressed run length encoded file $fi +lespec : \n" ; } else { $IN_IO = IO::File->new($filespec) or die "Cannot open run length encoded file $filespec : \n +" ; binmode $IN_IO; } my $buffer=""; # The buffer we are using my $records=0; # Number of record we have read my $buffers=0; # The number of times we have refilled the buffer my $bytes =0; # The number of bytes we have read so far # read until the file is empty while (!$IN_IO->eof ) { my $read_buffer; my $bytesread = $IN_IO->read( $read_buffer, $Config{buffer_siz +e} ); die "Read error in read_rle_file($filespec,$numfields)\n" unless defined $bytesread; $bytes+=$bytesread; $sha1->add($read_buffer); $buffer.=$read_buffer; my @records; # try to extract as many records as possible from the buffer BUFFER: while ($buffer) { # Keep extracting records as long as the buffe +r isnt empty my @record; while (@record<$numfields) { # Extract field by field until we have + a complete record. my $len=ord(substr($buffer,0,1)); # do we need to refill the buffer? if ($len+1>length $buffer) { # sigh, we do. Put what we've read sofar back into + the buffer $buffer=join("",(map{chr(length $_).$_}@record),$b +uffer); last BUFFER; } push @record,substr($buffer,1,$len); substr($buffer,0,$len+1,""); } push @records,\@record; } # hand off to the callback the records we have extracted so fa +r # we do this in chunks to save the callback overhead $sub->(\@records); $records+=@records; print "After buffer ".($buffers++)." read $records records fro +m $bytes bytes.\n" if $Debug>1; } die "Unprocessed data in buffer! read_rle_file($filespec,$numfield +s) failed!\n" if length $buffer; return wantarray ? ($sha1->b64digest,$records) : $sha1->b64digest; }
Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |