How to process each byte in a binary file?

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to process each byte in a binary file? by particle (Vicar) on Aug 12, 2002 at 19:54 UTC
for large files... `my $file = 'myverylargefile.bin'; { local INPUT; local $/ = \1; open INPUT, '<', $file or die $!; while(<INPUT>) { ## process here... } }` [download] from pervar: Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. ~Particle accelerates*	[reply] [d/l] [select]
Re: Re: How to process each byte in a binary file? by John M. Dlugosz (Monsignor) on Aug 12, 2002 at 21:16 UTC
It looks like $/ doesn't have any effect on IO::Scalar. I see that if the input is already in a file, and really is a primitive file handle, that this saves the trouble of reading it in first. But I wonder if the overhead of one read at a time is still high, compared to reading a chunk at a time and processing the chunks using one of the other methods.	[reply]
Re: How to process each byte in a binary file? by kschwab (Vicar) on Aug 12, 2002 at 20:58 UTC
How about IO::Scalar or IO::String ? You could then seek() and tell() around the string, or read() in 1 byte increments. I'm not sure what's under the covers, but both seem elegant from the outside.	[reply]
Re: Re: How to process each byte in a binary file? by John M. Dlugosz (Monsignor) on Aug 12, 2002 at 21:27 UTC
Doing a `read` of one byte does indeed work better than changing the input record size as suggested by an earlier reply. It also runs an order of magnitude slower than the next slower method under discussion! How it works under the covers? It uses `substr`. —John	[reply] [d/l] [select]
Re: How to process each byte in a binary file? by jmcnamara (Monsignor) on Aug 12, 2002 at 22:42 UTC
I'd guess that unpack is the fastest but if you are looking for alternatives to benchmark you could try this: `for (split //, $str, length $str) { ... }` Regardless of the method you choose it would probably be best to read and process the file in chunks. Playing around with the buffer size might lead to an optimization between the size of the read and size of the data to process: `#!/usr/bin/perl -w use strict; open FILE, 'reload.xls' or die "Error message here: $!"; binmode FILE; # as required my $buffer = 4096; my $str; while (read FILE, $str, $buffer) { for (split //, $str, $buffer) { # Your code here } }` [download] -- John.	[reply] [d/l] [select]
Benchmark Results by John M. Dlugosz (Monsignor) on Aug 12, 2002 at 21:48 UTC
Thus far, unpack"C" is the fastest. vec is 9% faster on a small input, 2% on a larger input, so there may be setup overhead there? substr is about 10% slower than vec. The regex/g is 1/3 to 1/2 the speed of substr. And using IO::Scalar is ten times slower than that! —John	[reply]
Re: How to process each byte in a binary file? by kschwab (Vicar) on Aug 12, 2002 at 22:32 UTC
Okay, looks like my IO::Scalar suggestion is not going to work. I guess that leaves unpack(), substr(),split(), and the regex ? Looks like unpack() is the clear winner on my machine: `#!/usr/bin/perl use Benchmark; my $string="X" x 102400; timethese(100, { 'split' => sub { for (split(//,$string)) {}; }, 'unpack' => sub { for (unpack("C",$string)) {}; }, 'regex' => sub { while($string =~ /./sg) {} }, 'substr' => sub { for(my $i=0;$i<length($string);$i++){ substr($string,$i,1); } }, });` [download] Gives me: $ perl foo Benchmark: timing 100 iterations of regex, split, substr, unpack... regex: 44 wallclock secs (43.13 usr + 0.00 sys = 43.13 CPU) split: 49 wallclock secs (47.90 usr + 0.04 sys = 47.94 CPU) substr: 58 wallclock secs (55.70 usr + 0.00 sys = 55.70 CPU) unpack: 27 wallclock secs (26.48 usr + 0.00 sys = 26.48 CPU) Update:*Reposted results after correcting typo.	[reply] [d/l]
Re: Re: How to process each byte in a binary file? by John M. Dlugosz (Monsignor) on Aug 13, 2002 at 00:46 UTC
I get similar results: split is between regex and substr. Makes me wonder, though, since split// is a "special case" that splits on every character, why it isn't simply as fast as unpack? —John	[reply]
Re: Re: Re: How to process each byte in a binary file? by kschwab (Vicar) on Aug 13, 2002 at 03:05 UTC
I added a test case for just using read(FILE,1) from a real file, and it's about the same speed as the unpack() on a string (for largish strings). Of course, this leaves the file open the whole time..but it's wonderfully simple :) I also have a very expensive Netapp filer helping the speed with read-ahead and a huge cache..YMMV.	[reply]
Re: How to process each byte in a binary file? by Anonymous Monk on Aug 12, 2002 at 20:23 UTC
vec EXPR,OFFSET,BITS Treats the string in EXPR as a bit vector made up of elements of width BITS, and returns the value of the element specified by OFFSET as an unsigned integer. BITS therefore specifies the number of bits that are reserved for each element in the bit vector. This must be a power of two from 1 to 32 (or 64, if your platform supports that). Maybe this is what you mean?	[reply]