John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I want to find a chunk of data that's delimited on each size by a specific 4-byte sequence. This can be embedded anywhere in an arbitrary binary file.

My first idea is to set the $/ to this 4-byte sequence and then read using the normal <FILEHANDLE> syntax, examining each "record" until I see the one that looks like the one I want.

However, the random breaks in the file other than the two I want could be any length. I can imagine Perl getting bogged down reading gigaabyte strings into contiguous memory while scanning for the RecordSeperator, only to have me throw the data away as uninteresting!

If I could set a maximum length and a Seperator, or have a way to skip records without reading them, it would be great. But I don't see those abilities in the built-in Perl stream support.

Any suggestions on a portable, interesting, instructive, and efficient way to accomplish this?

—John

Replies are listed 'Best First'.
Re: How do I search this binary file?
by tachyon (Chancellor) on Aug 20, 2002 at 19:45 UTC

    All you need to do is a little buffering like:

    my $last4 = ''; my $buffer = ''; my $find = qr /1234/; my $chunk = 7; # 2**20 or some such is more logical! while ( read ( DATA, $buffer, $chunk ) ) { print "got last-$last4\tbuffer-$buffer\n"; print "simple match\n" while $buffer =~ s/$find//; $buffer = $last4 . $buffer; $last4 = substr $buffer, -4, 4, ''; print "buffer match\n" if $buffer =~ m/$find/; } __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890 __END__ got last- buffer-1234567 simple match got last-567 buffer-8901234 simple match got last-7890 buffer-5678901 got last-8901 buffer-2345678 buffer match got last-5678 buffer-9012345 simple match got last-8905 buffer-6789012 got last-9012 buffer-3456789 buffer match got last-6789 buffer-0123456 simple match got last-9056 buffer-7890123 got last-0123 buffer-4567890 buffer match got last-7890 buffer-

    In this example I use a chunk size of 7 so that we need the buffer often but 2**20 (megabyte) chunks work effectively with big files (I wrote a node about processing big files fast at Re: Performance Question). You should be able to get around a 4-8MB/s throughput depending on your hardware.

    I'll leave it as an exercise for you as to what to do with the data between matches. Logging the positions and making a second pass through the file to get the data may well be most efficient.

    Note if we match out of the initial buffer we s/// it out before we add the previously buffered chunk and retest (avoids double match) - replace with a filler string to maintain length. There are 7 matches available and 7 found. A dozen lines with debugging - you gotta love Perl.

    Update

    Here is a version that records the positions of matches for you

    my $last4 = ''; my $buffer = ''; my $find = '1234'; my $find_re = qr /$find/; my $length = length $find; my $chunk = 7; # 2**20 or some such is more logical! my $pos = 0; while ( my $read_length = read ( DATA, $buffer, $chunk ) ) { $buffer = $last4 . $buffer; print "got last-$last4\tbuffer-$buffer\n"; while ($buffer =~ m/$find_re/g) { print "match at ", $pos - (length $last4) -$length + pos $bu +ffer, "\n"; } $last4 = substr $buffer, -$length, $length, ''; $last4 = '' if $last4 =~ m/$find_re/; # stop double match bug $pos += $read_length; } __DATA__ 1234567890123456789012345678901234567890123456789012345678901234567890

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Very good start, but I think some more can be squeezed out of it.
      my $buffer; my $delim = '1234'; my $delim_len = length $delim; my $chunk_len = 65536 - $delim_len; my $read_len = read $fh, $buffer, $delim_len; my $pos = 0; while ($read_len) { my $rel_pos = -1; print "delim at offset: ", $pos + $rel_pos while ($rel_pos = index $buffer, $delim, $rel_pos + 1) > -1; $buffer = substr $buffer, -$delim_len; $pos += $read_len; $read_len = read $fh, $buffer, $chunk_len, $delim_len; }
      A slower, but single pass alternative would be to shift the buffer back every time we find a match.
      my $buffer = ''; my $delim = '1234'; my $chunk_len = 65536; my $delim_len = length $delim; read $fh, $buffer, $chunk_len, length $buffer; my $rel_pos = -1; while (length $buffer) { $rel_pos = index $buffer, $delim, $rel_pos + 1; if($rel_pos > -1) { do_checks_on(substr $buffer, 0, $rel_pos - 1); $buffer = substr $buffer, $rel_pos + $delim_len; } else { $buffer = substr $buffer, -$delim_len; } read $fh, $buffer, $chunk_len - length $buffer, length $buffer; }

      Warning: this is untested code. I don't see any glaring mistakes though. It pulls the match to the front of the buffer, refills the back of the buffer and then looks for where the next match is. If none is found, it takes another whole bufferfull bite out of the file.

      Both of these snippets pay careful attention to always copy the last $delim_len bytes to the front of the buffer, reading the next load into the buffer at that offset, so any delimiters falling across the top boundary of the buffer are not a concern.

      Makeshifts last the longest.

      I don't understand your point. Why does double-scanning (and copying a megabyte) prevent "double matches"? Copying just the last 3 bytes (not 4!) to the beginning and then reading the next meg into the scalar after that would avoid the copy, and it can't match twice because the 4 bytes in the pattern are all different so matches can't overlap. Even without that restriction, the "junking" of previous found matches would take care of it.

        When you read from a disk you get a minimum of 512 bytes read (one sector) but in reality the disk reads and buffers a decent sized chunk (varies but ever wondered why disks have RAM?). Practical experimentation reveals an optimum read size (for a perl match type program) of 1-4MB as outlined in the RE:Performance Question link.

        OK 3 bytes is fine if all the bytes are different and there is no overlap - you did not specify.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: How do I search this binary file?
by bluto (Curate) on Aug 20, 2002 at 19:20 UTC
    The only way I can think of doing this *scalably* wrt memory seems pretty boring (i.e. you have to read in fixed sized blocks or something similar).

    Could you just use sysopen/sysread to read blocks of data in and then remember the offsets of the 4-byte sequences? You could then go back and seek directly to the remembered offset and sysread the data in, repeating as needed.

    You do have to worry about the case where the 4-bytes overlapped the end of a block. You could handle this by always reading ahead 1 block and then copying 3 bytes from the beginning of the next block to the end of the one you are working on before you check for the 4-byte sequence. (I've written too much C to be able to come up with anything more elegant.)

    bluto

Re: How do I search this binary file?
by derby (Abbot) on Aug 20, 2002 at 19:29 UTC
    Or you could depend on the buffering power of the IO libraries and inspect your data four-bytes at a time, rewinding three bytes if no match is found. Not as fast as fixed size records but not as bad as your probably think it is.

    #!/usr/bin/perl -wd $file = shift @ARGV || die "$0 <filename>\n"; open( FH, "<$file" ) or die "cannot open file $!\n"; #MARK $in_read = 0; while( read( FH, $four_bytes, 4 ) > 0 ) { last if( length( $four_bytes ) != 4 ); # EOF? if( $four_bytes eq "MARK" ) { print "buffer was ", $buffer, "\n" if $in_read; $in_read = $in_read ? 0 : 1; $buffer = ""; } else { $buffer .= substr( $four_bytes, 0, 1 ) if $in_read; # back up three bytes $pos = tell(FH) - 3; seek( FH, $pos, 0 ); } } # catch any trailing? $buffer .= $four_bytes if $in_read; # do something with the last buffer print "dangling buffer: ", $buffer, "\n" if $buffer;

    -derby

      Reading 4 bytes at a time is incredibly inefficient and slow. See my answer below for a solution and Re: Performance Question for an explanation

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: How do I search this binary file?
by Zaxo (Archbishop) on Aug 20, 2002 at 23:25 UTC

    You've gotten plenty of good suggestions for the processing. I have one to add on the size of chunk to read. In list context, the stat call returns a preferred block size at index eleven.

    sub strings_from_bin { my ($file, $dlim) = @_; local *BIN; open BIN, "< ".$file or die $!; binmode BIN; local $/ = \(stat BIN)[11]; my @strings; # ... populate @strings from <BIN> close BIN or die $!; @strings; }

    Update: I believe that the block size on fat filesystems can vary, but I have been unable to track down a way of reading it. I seem to recall that 4k and 32k were popular. Can anyone more familiar with win32 help?

    After Compline,
    Zaxo

      I've never noticed those fields before.

      Hmm, on Win32 stat only returned 11 elements, not 13.

      Found "Platforms that do not have rdev, blksize, or blocks will return these as '', so numeric comparison or manipulation of these fields may cause 'not numeric' warnings."

      So, it should really be something like: $blocksize= (stat BIN)[11] || 2048;

Re: How do I search this binary file?
by kschwab (Vicar) on Aug 20, 2002 at 19:42 UTC
    How about a find/rewind approach ? Go ahead and set $/ to your delimiter, but don't assign to a variable as you read. (for the sake of conserving memory on huge records)

    Once you have the record boundaries, then you could seek() back and process it however you need to.

    This is making the assumption that the bare <FILE> construct isn't using memory. I'm not clear on whether that's true or not.

    #!/usr/bin/perl my $delim="asdf"; $/=$delim; open(FILE,"/some/big/file"); my $offset=0; my $lastoffset=0; for (;;) { <FILE>; my $offset=tell(FILE); last if eof(FILE); print "this record spans [$lastoffset] to [$offset], including"; print "the delimiter\n"; print " this means you could seek to $lastoffset and read for "; printf "%d bytes to get the record\n", $offset-$lastoffset-length($delim); $lastoffset=$offset; }
      If <HANDLE> in void context functioned as a scan rather than a read, that would be terrific and elegant.
        It is a read, but in void context, it doesn't appear to be using excessive memory ( I suppose since it's not filling $_ or any other variable ).

        I did try this on a fairly large file (with large "records"), and didn't notice a significant growth in memory footprint compared to a smaller file.

        Did you have a different experience ? Update: Tried it on a really large file, and sure enough, it eats up VM :) Never mind...sounds like asking the gods to map <HANDLE> in a void context into a scan might be a good idea though...

Re: How do I search this binary file?
by talexb (Chancellor) on Aug 20, 2002 at 19:46 UTC
    If I were faced with this challenge, I would re-use a technique from my days as a C programmer, and read a chunk of N bytes into a buffer from the file. I'd search my way through the buffer, stopping when I went past the half way point.

    I'd then shift the buffer and where I'm searching in the buffer down by N/2 bytes, read another N/2 bytes into the now free half of the buffer, and continue until the pattern was found or the file is exhausted.

    I can't provide any working code at this time (paying work intervenes) but that should get you going.

    --t. alex
    but my friends call me T.

Re: How do I search this binary file?
by dpuu (Chaplain) on Aug 20, 2002 at 20:04 UTC
    An alternative approach is the old-fashioned state machine:
    1. set $/ to the first character of the separator (this is less likely to use gigabytes than looking for all 4). Read and discard a block.
    2. now read the next character: if it's the first char of the separator then repeat; if it's not the second, then goto step 1
    3. now read the next: if it's not the third, then goto step 1
    4. read the next: if it's the fourth, then you've found the start of your block. Otherwise, goto 1
    Now repeat a similar SM for the end separator (the only difference is tat you don't discard anything while you're in the middle of your block.--Dave
Re: How do I search this binary file?
by sauoq (Abbot) on Aug 20, 2002 at 19:02 UTC

    Do you have any idea how long your delimited chunk of data will be? If not, how about maximum and/or minimum lengths for it?

    Once you find the chunk you are searching for, how will you differentiate it from any other delimited chunk included in this "arbitrary binary file?" I ask because if you have some criteria by which you can evaluate it as being the one you want, that might help you in the actual search.

    Update: It occurred to me that since you seem to have a maximum size buffer in mind anyway, the best approach might be to simply read in a buffer of that size, search it with index for your delimiter, discard everything up to the delimiter, and use read to fill the buffer back up to the max size. That's more C-ish than Perlish I suppose but does it matter?

    Update 2: bluto's point below about the delimiter crossing the block boundary is very well-taken. That would have been a nasty bug in the method I described.

    -sauoq
    "My two cents aren't worth a dime.";
    
      re: Do you have any idea how long your delimited chunk of data will be? If not, how about maximum and/or minimum lengths for it?

      The minimum length (of what's between the two 4-byte endcaps) is only a couple bytes, I don't have my notes but it's a very small number, like maybe 5. The largest possible value is 128 bytes greater than that.

      re: Once you find the chunk you are searching for, how will you differentiate it from any other delimited chunk included in this "arbitrary binary file?" I ask because if you have some criteria by which you can evaluate it as being the one you want, that might help you in the actual search.

      OK, I'll go into more detail. Between the two endcaps, there will be another sequence of 3 or 4 specific bytes. Before that mark I expect 0-128 bytes of legal UTF-8 text. After that, the data also has internal format consistancies. Furthermore, the byte following the ending 4-byte delimiter is a hash "checksum" of the data block.

      I consider a valid "hit" if the middle mark is present, there is no FE or FF bytes before that mark, and the checksum checks. (after I accept it, I decide if the other data values are any good)

      Thanks;
      —John

        I think I would read the file in one block at a time using the blocksize returned by stat as suggested by Zaxo in a post below (Zaxo++) as long as it was larger than your max chunk size (I can't imagine it wouldn't be.)

        I'd use a regular expression to search for the whole chunk. If found, great; process it. If not, I'd start with the last 150 or so (one byte less than your max "chunk" size would do it) and use a four-argument read to append it to the leftover. Then search again... etc. etc.

        I don't know how well this approach would do next to some of the other suggestions. It has the advantage of looking for the whole chunk at once and using the regex engine to do it. Presumably that will be pretty quick. It has the disadvantage that you'll be searching through some fraction of the file twice. If you search the whole file, the number of bytes you'd search through twice would be approximately equal to the max chunk size times the size of the file in blocks. Given that, you might be able to improve it by increasing the size of the block you read. If you keep it a multiple of the preferred size it shouldn't hurt anything.

        -sauoq
        "My two cents aren't worth a dime.";