Reading binary file performance

oneill has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have implemented the following code to read a binary stream and am currently getting outperformed by some Java code which is 10x faster.

turning on binmode on the file increased speed, but I am slowly running out of options to optimise this program.

There are small optimisations that can be done in the code such as moving to an if-else block rather than a dispatch table. However, this hasn't really made much of an effect on the speed.

Wise monks is there anything in this program that is causing this to not run at its best?

UPDATE: I took a look over my program and found the OUTPUT_AUTOFLUSH was set to 1, I removed this and it increased speed by about 30%

# main
#
my $t0 = Benchmark->new();

my $FEED;
($go{file}) ?
    open $FEED, "<:perlio", $go{file} :
    open $FEED, "<&", *STDIN;

binmode $FEED;

my %vtmf_type = (
    0 => \&read_int,
    1 => \&read_byte,
    2 => \&var_length,
    3 => \&read_long,
    4 => sub { return 0 },
    5 => \&escape,
    6 => \&var_length,
    7 => \&read_price,
    __DEFAULT__ => sub { $log->warn("skipping message"); return ('', 2
+55) }
);

my $message_length_bin;
while (read $FEED, $message_length_bin, 2) {
    my $message_length = unpack 'n', $message_length_bin;

    my $tag_bin;
    while (read $FEED, $tag_bin, 2) {
        my $tag_field = unpack 'B16', $tag_bin;
        my $type = oct('0b' . substr $tag_field, 0, 3);
        my $tag = substr $tag_field, 3, 16;

        my ($value, $tag_length) = ($vtmf_type{$type} || $vtmf_type{__
+DEFAULT__})->($FEED);

        print oct('0b' . $tag) . '=' . $value . ',';

        $message_length -= 2 + $tag_length;

        last unless $message_length;
    }

    print "\n";
}

my $t1 = Benchmark->new();

if ($go{time}) {
    my $td = timediff($t1, $t0);
    $log->info("The code took: ", timestr($td));
}

exit 0;


# subroutines
#
sub read_int {
    my $handle = shift;

    read $FEED, my($num_bin), 4;

    my $value = hex unpack "H*", $num_bin;

    return ($value, 4);
}

sub read_long {
    my $handle = shift;

    read $handle, my($num_bin), 8;

    my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack  'H16', $num
+_bin);

    my $num = $hi * 2**32 + $lo;

    return ($num, 8);
}

sub read_price {
    my $handle = shift;

    read $handle, my ($num_bin), 8;

    my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack  'H16', $num
+_bin);

    my $num = $hi * 2**32 + $lo;

    $num /= 10**7;

    return ($num, 8);
}

sub read_byte {
    my $handle = shift;

    read $handle, my($num_bin), 1;

    my $val = unpack 'A', $num_bin;

    return ($val, 1);
}

sub escape {
    return ('', 0);
}

sub var_length {
    my $handle = shift;

    read $handle, my($length_bin), 1;
    my $length = hex unpack 'H2', $length_bin;

    read $handle, my($value_bin), $length;
    my $value = unpack "A${length}", $value_bin;

    return ($value, $length + 1);
}
[download]

Comment on Reading binary file performance Download Code

Replies are listed 'Best First'.
Re: Reading binary file performance by zentara (Cardinal) on Mar 24, 2014 at 17:01 UTC
Faster way to parse binary stream I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^2: Reading binary file performance by oneill (Initiate) on Mar 25, 2014 at 09:33 UTC
Thanks zentara, will take a look and see if that helps at all	[reply]
Re^2: Reading binary file performance by oneill (Initiate) on Mar 27, 2014 at 10:39 UTC
I rewrote the parser using substr as suggested in the thread you provided. It doesn't seem to have made much of a noticeable difference at all in terms of the speed. Have I fundamentally misunderstood anything? My implementation is below. Thanks # main # my $t0 = Benchmark->new(); my $FEED; ($go{file}) ? open $FEED, "<:perlio", $go{file} : open $FEED, "<&", STDIN; binmode $FEED; my %vtmf_type = ( 0 => \&int_type, 1 => \&byte_type, 2 => \&var_length, 3 => \&long_type, 4 => sub { return 0 }, 5 => \&subtype_type, 6 => \&var_length, 7 => \&price_type, __DEFAULT__ => sub { $log->warn("skipping message"); return ('', 2 +55) } ); my $message_length_bin; while (read $FEED, $message_length_bin, 2) { my $message_length = unpack 'n', $message_length_bin; my $offset = 0; read $FEED, my($message), $message_length; while ($offset < $message_length) { my $tag_bin = substr $message, $offset, 2; $offset += 2; my $tag_field = unpack 'B16', $tag_bin; my $type = oct('0b' . substr $tag_field, 0, 3); my $tag = substr $tag_field, 3, 16; my $value; ($value, $offset) = ($vtmf_type{$type} \|\| $vtmf_type{__DEFAULT +__})->($message, $offset); print oct('0b' . $tag) . '=' . $value . ','; } print "\n"; } my $t1 = Benchmark->new(); if ($go{time}) { my $td = timediff($t1, $t0); $log->info("The code took: ", timestr($td)); } exit 0; # subroutines # sub int_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 4; $offset += 4; my $value = hex unpack "H", $num_bin; return ($value, $offset); } sub long_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 8; $offset += 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $num +_bin); my $num = $hi * 232 + $lo; return ($num, $offset); } sub price_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 8; $offset += 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $num +_bin); my $num = $hi << 8 + $lo; $num /= 107; return ($num, $offset); } sub byte_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 1; $offset += 1; my $val = unpack 'A', $num_bin; return ($val, $offset); } sub subtype_type { my ($message, $offset) = @_; my $type = substr $message, $offset, 1; my ($value, $length) = ('', 0); if ($type eq 'B') { my $length_bin = substr $message, $offset, 2; $offset += 2; my $length = hex unpack 'H4', $length_bin; my $value_bin = substr $message, $offset, $length; $offset += $length; } elsif ($type eq 'S') { my $length_bin = substr $message, $offset, 2; $offset += 2; } elsif ($type eq 'P') { #skip 2 bytes $offset += 2; my $num_bin = substr $message, $offset, 8; $offset += 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $ +num_bin); my $num = $hi << 8 + $lo; my $exponent_bin = substr $message, $offset, 2; $offset +=2; my $exponent = hex unpack 'H4', $exponent_bin; $value = $num * 10**$exponent; $length = 13; 1; } else { } return ($value, $length); } sub var_length { my ($message, $offset) = @_; my $length_bin = substr $message, $offset, 1; $offset += 1; my $length = hex unpack 'H2', $length_bin; my $value_bin = substr $message, $offset, $length; $offset += $len +gth; my $value = unpack "A${length}", $value_bin; return ($value, $offset); } __END__ [download]	[reply] [d/l]
Re^3: Reading binary file performance by BrowserUk (Patriarch) on Mar 27, 2014 at 12:24 UTC
Do you have a sample file? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Reading binary file performance by oneill (Initiate) on Mar 27, 2014 at 13:30 UTC
Re^5: Reading binary file performance by BrowserUk (Patriarch) on Mar 27, 2014 at 13:39 UTC
Some notes below your chosen depth have not been shown here
Re^3: Reading binary file performance by Anonymous Monk on Mar 27, 2014 at 11:55 UTC
Have I fundamentally misunderstood anything Yes, the three main tricks , but they're not exactly easy to spot unless you're familiar with them :) 1) one single correct unpack call with a clever template is faster than everything else 2) slurping the entire file into one large string is faster than reading byte by byte (its how hard disks work) 3) substr-ing accross a string is faster than chopping a string (or copying a string then chopping it ... aliasing and/or pass-by-reference is faster than copying What you did is replace unpack with substr+unpack -- two operations with one -- this will be slower one of the slow things about your original program is using oct/hex+unpack -- unpack can do most things by itself , see Re: ID3v2 TAG Footer Reading goes wrong (more subs),Re: hex to binary ( UInt32 / Int32 ) reducing the number of calls speeds things up	[reply]
Re^4: Reading binary file performance by oneill (Initiate) on Mar 27, 2014 at 13:44 UTC
Re: Reading binary file performance by zentara (Cardinal) on Mar 27, 2014 at 14:25 UTC
If you can't locate where the slowdown is occuring, consider running Devel::NYTProf to analyze your script for time consumers. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]