oneill has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have implemented the following code to read a binary stream and am currently getting outperformed by some Java code which is 10x faster.

turning on binmode on the file increased speed, but I am slowly running out of options to optimise this program.

There are small optimisations that can be done in the code such as moving to an if-else block rather than a dispatch table. However, this hasn't really made much of an effect on the speed.

Wise monks is there anything in this program that is causing this to not run at its best?

UPDATE: I took a look over my program and found the OUTPUT_AUTOFLUSH was set to 1, I removed this and it increased speed by about 30%

# main # my $t0 = Benchmark->new(); my $FEED; ($go{file}) ? open $FEED, "<:perlio", $go{file} : open $FEED, "<&", *STDIN; binmode $FEED; my %vtmf_type = ( 0 => \&read_int, 1 => \&read_byte, 2 => \&var_length, 3 => \&read_long, 4 => sub { return 0 }, 5 => \&escape, 6 => \&var_length, 7 => \&read_price, __DEFAULT__ => sub { $log->warn("skipping message"); return ('', 2 +55) } ); my $message_length_bin; while (read $FEED, $message_length_bin, 2) { my $message_length = unpack 'n', $message_length_bin; my $tag_bin; while (read $FEED, $tag_bin, 2) { my $tag_field = unpack 'B16', $tag_bin; my $type = oct('0b' . substr $tag_field, 0, 3); my $tag = substr $tag_field, 3, 16; my ($value, $tag_length) = ($vtmf_type{$type} || $vtmf_type{__ +DEFAULT__})->($FEED); print oct('0b' . $tag) . '=' . $value . ','; $message_length -= 2 + $tag_length; last unless $message_length; } print "\n"; } my $t1 = Benchmark->new(); if ($go{time}) { my $td = timediff($t1, $t0); $log->info("The code took: ", timestr($td)); } exit 0; # subroutines # sub read_int { my $handle = shift; read $FEED, my($num_bin), 4; my $value = hex unpack "H*", $num_bin; return ($value, 4); } sub read_long { my $handle = shift; read $handle, my($num_bin), 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $num +_bin); my $num = $hi * 2**32 + $lo; return ($num, 8); } sub read_price { my $handle = shift; read $handle, my ($num_bin), 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $num +_bin); my $num = $hi * 2**32 + $lo; $num /= 10**7; return ($num, 8); } sub read_byte { my $handle = shift; read $handle, my($num_bin), 1; my $val = unpack 'A', $num_bin; return ($val, 1); } sub escape { return ('', 0); } sub var_length { my $handle = shift; read $handle, my($length_bin), 1; my $length = hex unpack 'H2', $length_bin; read $handle, my($value_bin), $length; my $value = unpack "A${length}", $value_bin; return ($value, $length + 1); }

Replies are listed 'Best First'.
Re: Reading binary file performance
by zentara (Cardinal) on Mar 24, 2014 at 17:01 UTC
      Thanks zentara, will take a look and see if that helps at all

      I rewrote the parser using substr as suggested in the thread you provided. It doesn't seem to have made much of a noticeable difference at all in terms of the speed.

      Have I fundamentally misunderstood anything? My implementation is below.

      Thanks

      # main # my $t0 = Benchmark->new(); my $FEED; ($go{file}) ? open $FEED, "<:perlio", $go{file} : open $FEED, "<&", *STDIN; binmode $FEED; my %vtmf_type = ( 0 => \&int_type, 1 => \&byte_type, 2 => \&var_length, 3 => \&long_type, 4 => sub { return 0 }, 5 => \&subtype_type, 6 => \&var_length, 7 => \&price_type, __DEFAULT__ => sub { $log->warn("skipping message"); return ('', 2 +55) } ); my $message_length_bin; while (read $FEED, $message_length_bin, 2) { my $message_length = unpack 'n', $message_length_bin; my $offset = 0; read $FEED, my($message), $message_length; while ($offset < $message_length) { my $tag_bin = substr $message, $offset, 2; $offset += 2; my $tag_field = unpack 'B16', $tag_bin; my $type = oct('0b' . substr $tag_field, 0, 3); my $tag = substr $tag_field, 3, 16; my $value; ($value, $offset) = ($vtmf_type{$type} || $vtmf_type{__DEFAULT +__})->($message, $offset); print oct('0b' . $tag) . '=' . $value . ','; } print "\n"; } my $t1 = Benchmark->new(); if ($go{time}) { my $td = timediff($t1, $t0); $log->info("The code took: ", timestr($td)); } exit 0; # subroutines # sub int_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 4; $offset += 4; my $value = hex unpack "H*", $num_bin; return ($value, $offset); } sub long_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 8; $offset += 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $num +_bin); my $num = $hi * 2**32 + $lo; return ($num, $offset); } sub price_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 8; $offset += 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $num +_bin); my $num = $hi << 8 + $lo; $num /= 10**7; return ($num, $offset); } sub byte_type { my ($message, $offset) = @_; my $num_bin = substr $message, $offset, 1; $offset += 1; my $val = unpack 'A', $num_bin; return ($val, $offset); } sub subtype_type { my ($message, $offset) = @_; my $type = substr $message, $offset, 1; my ($value, $length) = ('', 0); if ($type eq 'B') { my $length_bin = substr $message, $offset, 2; $offset += 2; my $length = hex unpack 'H4', $length_bin; my $value_bin = substr $message, $offset, $length; $offset += $length; } elsif ($type eq 'S') { my $length_bin = substr $message, $offset, 2; $offset += 2; } elsif ($type eq 'P') { #skip 2 bytes $offset += 2; my $num_bin = substr $message, $offset, 8; $offset += 8; my ($hi, $lo) = map { hex $_ } (unpack 'A8A8', unpack 'H16', $ +num_bin); my $num = $hi << 8 + $lo; my $exponent_bin = substr $message, $offset, 2; $offset +=2; my $exponent = hex unpack 'H4', $exponent_bin; $value = $num * 10**$exponent; $length = 13; 1; } else { } return ($value, $length); } sub var_length { my ($message, $offset) = @_; my $length_bin = substr $message, $offset, 1; $offset += 1; my $length = hex unpack 'H2', $length_bin; my $value_bin = substr $message, $offset, $length; $offset += $len +gth; my $value = unpack "A${length}", $value_bin; return ($value, $offset); } __END__

        Do you have a sample file?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Have I fundamentally misunderstood anything

        Yes, the three main tricks , but they're not exactly easy to spot unless you're familiar with them :)
        1) one single correct unpack call with a clever template is faster than everything else
        2) slurping the entire file into one large string is faster than reading byte by byte (its how hard disks work)
        3) substr-ing accross a string is faster than chopping a string (or copying a string then chopping it ...
        aliasing and/or pass-by-reference is faster than copying

        What you did is replace unpack with substr+unpack -- two operations with one -- this will be slower

        one of the slow things about your original program is using oct/hex+unpack -- unpack can do most things by itself , see Re: ID3v2 TAG Footer Reading goes wrong (more subs),Re: hex to binary ( UInt32 / Int32 )

        reducing the number of calls speeds things up

Re: Reading binary file performance
by zentara (Cardinal) on Mar 27, 2014 at 14:25 UTC