in reply to Re^2: Search hex string in vary large binary file
in thread Search hex string in vary large binary file

Agreed!

The point was also meant to be more general about the selection of the solution: personally, my Plan A would be "see if there's a module to do it 'right'", and Plan B would be "meh, I'll just grep the whole file", not the other way around (as the OP seems to imply).

  • Comment on Re^3: Search hex string in vary large binary file

Replies are listed 'Best First'.
Re^4: Search hex string in vary large binary file
by BrowserUk (Patriarch) on Feb 07, 2015 at 16:18 UTC

    Hm. I see this in the same way I see extracting one or two pieces of information from a web page. I can either:

    1. Laboriously parse the entire structure of the document into a complex data structure and then traverse it to obtains the bits;
    2. Or I can treat the whole thing as unstructured data and just grab the bits I need.

    In the OPs case, given he only wants a yay or nay answer; and the odds of a false positive are so minuscule; parsing the entire file is a waste of cpu cycles, time, and effort.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      In the OPs case, given he only wants a yay or nay answer ... parsing the entire file is a waste of cpu cycles, time, and effort.

      You probably have much more experience with working with big data than I do, but in this case: On my machine, on a single 1.6GB video file where there is no match, your code takes 1.6s to complete (when working from disk cache), whereas a patched MP4::Info comes up with an answer in less than 0.1s. And MP4::Info scans 41GB of video files in under a second. Example code:

      use MP4::Info 'get_mp4tag'; my $tag = get_mp4tag($ARGV[0]); print $tag && $tag->{HDVD} && $tag->{HDVD}==2 ? "Match! $ARGV[0]\n" : "No Match!\n";

      In the general case, I tend to think the right tool for the job is much more likely to be a module (if it exists) - except maybe in the case of large amounts of input data, where optimizations may be necessary.

      P.S. I'm sure you've read You can't parse (X)HTML with regex :-)

        your code takes 1.6s to complete ... patched MP4::Info less than 0.1s.

        Hm. Sample code? The patch?

        P.S. I'm sure you've read You can't parse (X)HTML with regex :-)

        Tim Bray, one of the guys that put together the XML spec. does (and apparently prefers to); but that's by-the-by ....

        "Parse", in the sense of read-tokenise-build a structure that represents the entire document: I probably could, but it'd be more work than I'd take on. Especially when there are free modules that will do that for me.

        But if you want to extract a few values from within a jumble of text for which there is no parser, regex is the way to go.

        So, if I don't give a fig for the structure of the document, I treat it as a "jumble of text"; and get the job done.

        All I need is a unique anchor. And there *always* is one.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked