in reply to Search hex string in vary large binary file

Just to point out an issue with your approach: If you were simply looking for a specific sequence of 24 bytes within up to 20,000,000,000 bytes, what about false positives? To avoid that, you'd actually have to parse the file and only look in the appropriate places for that flag. Which, if you were to DIY, would be a lot of reading specs and writing code, so it really is best to use an existing tool.

You're in luck! Someone actually submitted a patch for MP4::Info to add support for the HDVD tag: https://rt.cpan.org/Public/Bug/Display.html?id=101016

There's a quick & really dirty way to patch the module on your system: "wget -nv https://rt.cpan.org/Ticket/Attachment/1444239/767837/0001-add-support-for-HDVD-tag.patch -O- | patch `perldoc -l MP4::Info`" (you'll probably need to do this as root). However, a somewhat cleaner way would be to patch the module before installation:

# in the shell: $ cd /tmp $ wget http://www.cpan.org/authors/id/J/JH/JHAR/MP4-Info-1.13.tar.gz $ tar xzf MP4-Info-1.13.tar.gz $ cd MP4-Info-1.13/ $ wget -nv https://rt.cpan.org/Ticket/Attachment/1444239/767837/0001-a +dd-support-for-HDVD-tag.patch -O- | patch

... and then install to a local module repository separate from your system's modules. For example, see the instructions under "I don't have permission to install a module on the system!" in A Guide to Installing Modules.

Replies are listed 'Best First'.
Re^2: Search hex string in vary large binary file
by BrowserUk (Patriarch) on Feb 07, 2015 at 15:03 UTC

    Its a point; but I wonder how many .mv4s you'd have to search before you found "hdvd" & "data" separated by exactly 4 bytes that wasn't part of the required 24 bytes?

    To clarify, in totally random data, there are 256**24 (6.2771e+57) permutations of 24 bytes.

    A 20GB file has 21474836473 sets of 24-bytes.

    So the odds of one of them being a false hit is: 3.4211e-48 (0.00000000000000000000000000000000000000000000034211%). And every restriction on those bytes increases the odds.

    Pretty good odds that any hit is a good one.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Agreed!

      The point was also meant to be more general about the selection of the solution: personally, my Plan A would be "see if there's a module to do it 'right'", and Plan B would be "meh, I'll just grep the whole file", not the other way around (as the OP seems to imply).

        Hm. I see this in the same way I see extracting one or two pieces of information from a web page. I can either:

        1. Laboriously parse the entire structure of the document into a complex data structure and then traverse it to obtains the bits;
        2. Or I can treat the whole thing as unstructured data and just grab the bits I need.

        In the OPs case, given he only wants a yay or nay answer; and the odds of a false positive are so minuscule; parsing the entire file is a waste of cpu cycles, time, and effort.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked