Hm. I see this in the same way I see extracting one or two pieces of information from a web page. I can either:
- Laboriously parse the entire structure of the document into a complex data structure and then traverse it to obtains the bits;
- Or I can treat the whole thing as unstructured data and just grab the bits I need.
In the OPs case, given he only wants a yay or nay answer; and the odds of a false positive are so minuscule; parsing the entire file is a waste of cpu cycles, time, and effort.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] |
In the OPs case, given he only wants a yay or nay answer ... parsing the entire file is a waste of cpu cycles, time, and effort.
You probably have much more experience with working with big data than I do, but in this case: On my machine, on a single 1.6GB video file where there is no match, your code takes 1.6s to complete (when working from disk cache), whereas a patched MP4::Info comes up with an answer in less than 0.1s. And MP4::Info scans 41GB of video files in under a second. Example code:
use MP4::Info 'get_mp4tag';
my $tag = get_mp4tag($ARGV[0]);
print $tag && $tag->{HDVD} && $tag->{HDVD}==2
? "Match! $ARGV[0]\n" : "No Match!\n";
In the general case, I tend to think the right tool for the job is much more likely to be a module (if it exists) - except maybe in the case of large amounts of input data, where optimizations may be necessary.
P.S. I'm sure you've read You can't parse (X)HTML with regex :-)
| [reply] [d/l] |
your code takes 1.6s to complete ... patched MP4::Info less than 0.1s.
Hm. Sample code? The patch?
P.S. I'm sure you've read You can't parse (X)HTML with regex :-)
Tim Bray, one of the guys that put together the XML spec. does (and apparently prefers to); but that's by-the-by ....
"Parse", in the sense of read-tokenise-build a structure that represents the entire document: I probably could, but it'd be more work than I'd take on. Especially when there are free modules that will do that for me.
But if you want to extract a few values from within a jumble of text for which there is no parser, regex is the way to go.
So, if I don't give a fig for the structure of the document, I treat it as a "jumble of text"; and get the job done.
All I need is a unique anchor. And there *always* is one.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] |