in reply to How do I search this binary file?

Do you have any idea how long your delimited chunk of data will be? If not, how about maximum and/or minimum lengths for it?

Once you find the chunk you are searching for, how will you differentiate it from any other delimited chunk included in this "arbitrary binary file?" I ask because if you have some criteria by which you can evaluate it as being the one you want, that might help you in the actual search.

Update: It occurred to me that since you seem to have a maximum size buffer in mind anyway, the best approach might be to simply read in a buffer of that size, search it with index for your delimiter, discard everything up to the delimiter, and use read to fill the buffer back up to the max size. That's more C-ish than Perlish I suppose but does it matter?

Update 2: bluto's point below about the delimiter crossing the block boundary is very well-taken. That would have been a nasty bug in the method I described.

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re: Re: How do I search this binary file?
by John M. Dlugosz (Monsignor) on Aug 20, 2002 at 22:07 UTC
    re: Do you have any idea how long your delimited chunk of data will be? If not, how about maximum and/or minimum lengths for it?

    The minimum length (of what's between the two 4-byte endcaps) is only a couple bytes, I don't have my notes but it's a very small number, like maybe 5. The largest possible value is 128 bytes greater than that.

    re: Once you find the chunk you are searching for, how will you differentiate it from any other delimited chunk included in this "arbitrary binary file?" I ask because if you have some criteria by which you can evaluate it as being the one you want, that might help you in the actual search.

    OK, I'll go into more detail. Between the two endcaps, there will be another sequence of 3 or 4 specific bytes. Before that mark I expect 0-128 bytes of legal UTF-8 text. After that, the data also has internal format consistancies. Furthermore, the byte following the ending 4-byte delimiter is a hash "checksum" of the data block.

    I consider a valid "hit" if the middle mark is present, there is no FE or FF bytes before that mark, and the checksum checks. (after I accept it, I decide if the other data values are any good)

    Thanks;
    —John

      I think I would read the file in one block at a time using the blocksize returned by stat as suggested by Zaxo in a post below (Zaxo++) as long as it was larger than your max chunk size (I can't imagine it wouldn't be.)

      I'd use a regular expression to search for the whole chunk. If found, great; process it. If not, I'd start with the last 150 or so (one byte less than your max "chunk" size would do it) and use a four-argument read to append it to the leftover. Then search again... etc. etc.

      I don't know how well this approach would do next to some of the other suggestions. It has the advantage of looking for the whole chunk at once and using the regex engine to do it. Presumably that will be pretty quick. It has the disadvantage that you'll be searching through some fraction of the file twice. If you search the whole file, the number of bytes you'd search through twice would be approximately equal to the max chunk size times the size of the file in blocks. Given that, you might be able to improve it by increasing the size of the block you read. If you keep it a multiple of the preferred size it shouldn't hurt anything.

      -sauoq
      "My two cents aren't worth a dime.";
      
        I can see the benifit of that approach, in that the logic is simple and easy to write correctly.

        It can be further optimized by only overlapping a possible partial match -- that is, if the delimiter is present towards the end, copy that through the end. Otherwise, don't bother. The single re can return a capture for the begin marker and optionally find the remainder.