Re: How do I search this binary file?

All you need to do is a little buffering like:

my $last4 = '';
my $buffer = '';
my $find = qr /1234/;
my $chunk = 7;  # 2**20 or some such is more logical!
while ( read ( DATA, $buffer, $chunk ) ) {
    print "got last-$last4\tbuffer-$buffer\n";
    print "simple match\n" while $buffer =~ s/$find//;
    $buffer = $last4 . $buffer;
    $last4 = substr $buffer, -4, 4, '';
    print "buffer match\n" if $buffer =~ m/$find/;
}
__DATA__
1234567890123456789012345678901234567890123456789012345678901234567890

__END__
got last-    buffer-1234567
simple match
got last-567    buffer-8901234
simple match
got last-7890    buffer-5678901
got last-8901    buffer-2345678
buffer match
got last-5678    buffer-9012345
simple match
got last-8905    buffer-6789012
got last-9012    buffer-3456789
buffer match
got last-6789    buffer-0123456
simple match
got last-9056    buffer-7890123
got last-0123    buffer-4567890
buffer match
got last-7890    buffer-
[download]

In this example I use a chunk size of 7 so that we need the buffer often but 2**20 (megabyte) chunks work effectively with big files (I wrote a node about processing big files fast at Re: Performance Question). You should be able to get around a 4-8MB/s throughput depending on your hardware.

I'll leave it as an exercise for you as to what to do with the data between matches. Logging the positions and making a second pass through the file to get the data may well be most efficient.

Note if we match out of the initial buffer we s/// it out before we add the previously buffered chunk and retest (avoids double match) - replace with a filler string to maintain length. There are 7 matches available and 7 found. A dozen lines with debugging - you gotta love Perl.

Update

Here is a version that records the positions of matches for you

my $last4 = '';
my $buffer = '';
my $find = '1234';
my $find_re = qr /$find/;
my $length = length $find;
my $chunk = 7;  # 2**20 or some such is more logical!
my $pos = 0;
while ( my $read_length = read ( DATA, $buffer, $chunk ) ) {
    $buffer = $last4 . $buffer;
    print "got last-$last4\tbuffer-$buffer\n";
    while ($buffer =~ m/$find_re/g) {
         print "match at ", $pos - (length $last4) -$length  + pos $bu
+ffer, "\n";
    }
    $last4 = substr $buffer, -$length, $length, '';
    $last4 = '' if $last4 =~ m/$find_re/;  # stop double match bug
    $pos += $read_length;
}
__DATA__
1234567890123456789012345678901234567890123456789012345678901234567890
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Comment on Re: How do I search this binary file? Select or Download Code

Replies are listed 'Best First'.
Re^2: How do I search this binary file? by Aristotle (Chancellor) on Aug 21, 2002 at 01:47 UTC
Very good start, but I think some more can be squeezed out of it. `my $buffer; my $delim = '1234'; my $delim_len = length $delim; my $chunk_len = 65536 - $delim_len; my $read_len = read $fh, $buffer, $delim_len; my $pos = 0; while ($read_len) { my $rel_pos = -1; print "delim at offset: ", $pos + $rel_pos while ($rel_pos = index $buffer, $delim, $rel_pos + 1) > -1; $buffer = substr $buffer, -$delim_len; $pos += $read_len; $read_len = read $fh, $buffer, $chunk_len, $delim_len; }` [download] A slower, but single pass alternative would be to shift the buffer back every time we find a match. `my $buffer = ''; my $delim = '1234'; my $chunk_len = 65536; my $delim_len = length $delim; read $fh, $buffer, $chunk_len, length $buffer; my $rel_pos = -1; while (length $buffer) { $rel_pos = index $buffer, $delim, $rel_pos + 1; if($rel_pos > -1) { do_checks_on(substr $buffer, 0, $rel_pos - 1); $buffer = substr $buffer, $rel_pos + $delim_len; } else { $buffer = substr $buffer, -$delim_len; } read $fh, $buffer, $chunk_len - length $buffer, length $buffer; }` [download] Warning: this is untested code. I don't see any glaring mistakes though. It pulls the match to the front of the buffer, refills the back of the buffer and then looks for where the next match is. If none is found, it takes another whole bufferfull bite out of the file. Both of these snippets pay careful attention to always copy the last `$delim_len` bytes to the front of the buffer, reading the next load into the buffer at that offset, so any delimiters falling across the top boundary of the buffer are not a concern. Makeshifts last the longest.	[reply] [d/l] [select]
Re: Re: How do I search this binary file? by John M. Dlugosz (Monsignor) on Aug 20, 2002 at 22:23 UTC
I don't understand your point. Why does double-scanning (and copying a megabyte) prevent "double matches"? Copying just the last 3 bytes (not 4!) to the beginning and then reading the next meg into the scalar after that would avoid the copy, and it can't match twice because the 4 bytes in the pattern are all different so matches can't overlap. Even without that restriction, the "junking" of previous found matches would take care of it.	[reply]
Re: Re: Re: How do I search this binary file? by tachyon (Chancellor) on Aug 20, 2002 at 22:42 UTC
When you read from a disk you get a minimum of 512 bytes read (one sector) but in reality the disk reads and buffers a decent sized chunk (varies but ever wondered why disks have RAM?). Practical experimentation reveals an optimum read size (for a perl match type program) of 1-4MB as outlined in the RE:Performance Question link. OK 3 bytes is fine if all the bytes are different and there is no overlap - you did not specify. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]