in reply to Re: Search hex string in vary large binary file
in thread Search hex string in vary large binary file

Its a point; but I wonder how many .mv4s you'd have to search before you found "hdvd" & "data" separated by exactly 4 bytes that wasn't part of the required 24 bytes?

To clarify, in totally random data, there are 256**24 (6.2771e+57) permutations of 24 bytes.

A 20GB file has 21474836473 sets of 24-bytes.

So the odds of one of them being a false hit is: 3.4211e-48 (0.00000000000000000000000000000000000000000000034211%). And every restriction on those bytes increases the odds.

Pretty good odds that any hit is a good one.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
  • Comment on Re^2: Search hex string in vary large binary file

Replies are listed 'Best First'.
Re^3: Search hex string in vary large binary file
by Anonymous Monk on Feb 07, 2015 at 15:57 UTC

    Agreed!

    The point was also meant to be more general about the selection of the solution: personally, my Plan A would be "see if there's a module to do it 'right'", and Plan B would be "meh, I'll just grep the whole file", not the other way around (as the OP seems to imply).

      Hm. I see this in the same way I see extracting one or two pieces of information from a web page. I can either:

      1. Laboriously parse the entire structure of the document into a complex data structure and then traverse it to obtains the bits;
      2. Or I can treat the whole thing as unstructured data and just grab the bits I need.

      In the OPs case, given he only wants a yay or nay answer; and the odds of a false positive are so minuscule; parsing the entire file is a waste of cpu cycles, time, and effort.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        In the OPs case, given he only wants a yay or nay answer ... parsing the entire file is a waste of cpu cycles, time, and effort.

        You probably have much more experience with working with big data than I do, but in this case: On my machine, on a single 1.6GB video file where there is no match, your code takes 1.6s to complete (when working from disk cache), whereas a patched MP4::Info comes up with an answer in less than 0.1s. And MP4::Info scans 41GB of video files in under a second. Example code:

        use MP4::Info 'get_mp4tag'; my $tag = get_mp4tag($ARGV[0]); print $tag && $tag->{HDVD} && $tag->{HDVD}==2 ? "Match! $ARGV[0]\n" : "No Match!\n";

        In the general case, I tend to think the right tool for the job is much more likely to be a module (if it exists) - except maybe in the case of large amounts of input data, where optimizations may be necessary.

        P.S. I'm sure you've read You can't parse (X)HTML with regex :-)