in reply to How do I search this binary file?

All you need to do is a little buffering like:

my $last4 = ''; my $buffer = ''; my $find = qr /1234/; my $chunk = 7; # 2**20 or some such is more logical! while ( read ( DATA, $buffer, $chunk ) ) { print "got last-$last4\tbuffer-$buffer\n"; print "simple match\n" while $buffer =~ s/$find//; $buffer = $last4 . $buffer; $last4 = substr $buffer, -4, 4, ''; print "buffer match\n" if $buffer =~ m/$find/; } __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890 __END__ got last- buffer-1234567 simple match got last-567 buffer-8901234 simple match got last-7890 buffer-5678901 got last-8901 buffer-2345678 buffer match got last-5678 buffer-9012345 simple match got last-8905 buffer-6789012 got last-9012 buffer-3456789 buffer match got last-6789 buffer-0123456 simple match got last-9056 buffer-7890123 got last-0123 buffer-4567890 buffer match got last-7890 buffer-

In this example I use a chunk size of 7 so that we need the buffer often but 2**20 (megabyte) chunks work effectively with big files (I wrote a node about processing big files fast at Re: Performance Question). You should be able to get around a 4-8MB/s throughput depending on your hardware.

I'll leave it as an exercise for you as to what to do with the data between matches. Logging the positions and making a second pass through the file to get the data may well be most efficient.

Note if we match out of the initial buffer we s/// it out before we add the previously buffered chunk and retest (avoids double match) - replace with a filler string to maintain length. There are 7 matches available and 7 found. A dozen lines with debugging - you gotta love Perl.

Update

Here is a version that records the positions of matches for you

my $last4 = ''; my $buffer = ''; my $find = '1234'; my $find_re = qr /$find/; my $length = length $find; my $chunk = 7; # 2**20 or some such is more logical! my $pos = 0; while ( my $read_length = read ( DATA, $buffer, $chunk ) ) { $buffer = $last4 . $buffer; print "got last-$last4\tbuffer-$buffer\n"; while ($buffer =~ m/$find_re/g) { print "match at ", $pos - (length $last4) -$length + pos $bu +ffer, "\n"; } $last4 = substr $buffer, -$length, $length, ''; $last4 = '' if $last4 =~ m/$find_re/; # stop double match bug $pos += $read_length; } __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re^2: How do I search this binary file?
by Aristotle (Chancellor) on Aug 21, 2002 at 01:47 UTC
    Very good start, but I think some more can be squeezed out of it.
    my $buffer; my $delim = '1234'; my $delim_len = length $delim; my $chunk_len = 65536 - $delim_len; my $read_len = read $fh, $buffer, $delim_len; my $pos = 0; while ($read_len) { my $rel_pos = -1; print "delim at offset: ", $pos + $rel_pos while ($rel_pos = index $buffer, $delim, $rel_pos + 1) > -1; $buffer = substr $buffer, -$delim_len; $pos += $read_len; $read_len = read $fh, $buffer, $chunk_len, $delim_len; }
    A slower, but single pass alternative would be to shift the buffer back every time we find a match.
    my $buffer = ''; my $delim = '1234'; my $chunk_len = 65536; my $delim_len = length $delim; read $fh, $buffer, $chunk_len, length $buffer; my $rel_pos = -1; while (length $buffer) { $rel_pos = index $buffer, $delim, $rel_pos + 1; if($rel_pos > -1) { do_checks_on(substr $buffer, 0, $rel_pos - 1); $buffer = substr $buffer, $rel_pos + $delim_len; } else { $buffer = substr $buffer, -$delim_len; } read $fh, $buffer, $chunk_len - length $buffer, length $buffer; }

    Warning: this is untested code. I don't see any glaring mistakes though. It pulls the match to the front of the buffer, refills the back of the buffer and then looks for where the next match is. If none is found, it takes another whole bufferfull bite out of the file.

    Both of these snippets pay careful attention to always copy the last $delim_len bytes to the front of the buffer, reading the next load into the buffer at that offset, so any delimiters falling across the top boundary of the buffer are not a concern.

    Makeshifts last the longest.

Re: Re: How do I search this binary file?
by John M. Dlugosz (Monsignor) on Aug 20, 2002 at 22:23 UTC
    I don't understand your point. Why does double-scanning (and copying a megabyte) prevent "double matches"? Copying just the last 3 bytes (not 4!) to the beginning and then reading the next meg into the scalar after that would avoid the copy, and it can't match twice because the 4 bytes in the pattern are all different so matches can't overlap. Even without that restriction, the "junking" of previous found matches would take care of it.

      When you read from a disk you get a minimum of 512 bytes read (one sector) but in reality the disk reads and buffers a decent sized chunk (varies but ever wondered why disks have RAM?). Practical experimentation reveals an optimum read size (for a perl match type program) of 1-4MB as outlined in the RE:Performance Question link.

      OK 3 bytes is fine if all the bytes are different and there is no overlap - you did not specify.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print