RE: RE: RE: Re: A grammar for HTML matching

HTML::TokeParser uses HTML::Parser to parse the document for it. So while you can tell HTML::TokeParser to give you the next <img> tag, in reality what it does is lets HTML::Parser call HTML::TokeParser::start for every tag, and then HTML::TokeParser::get_tag examines the tree generated by calling start/end all these times, and only returns the tag you're looking for. It's all these function calls (one for each tag in the document) that make it slower than other possible implementations that look for something specific.

Thus HTML::TokeParser cannot possibly be faster than HTML::Parser. (Please correct me if I'm wrong...looking at TokeParser source here...looks like it will do parsing in 512 byte chunks if passed a file rather than a scalar...but the speed hit comes from all the HTML::Parser callbacks for tags you don't care about...so parsing in chunks shouldn't help).

If I gave the impression that a regex on an entire document is faster than HTML::Parser (and friends) I apologize, this is obviously incorrect. I have written an implementation for a finder which looks for something specific by traversing the document using index/rindex, then pos() and m/\G/g. This is faster (when looking for something specific) than an equivalent HTML::Parser or HTML::TokeParser implementation.

But as I've mentioned elsewhere, it's not the implementation I'm interested in at this point, it's the grammar (and the utility of such a grammar). I'd be perfectly happy with a HTML::TokeParser implementation. (and in fact I will write one -- no use speculating about what is faster when you can write it and measure it)

Comment on RE: RE: RE: Re: A grammar for HTML matching