RE: Re: A grammar for HTML matching

I've tested it. For my application a regex is often 100 times faster. The difference is that HTML::Parser must examine (and copy out) every single tag in the document. For a typical document, this is ~1000 matches, and 1000 function calls (start/end). A single regex (or even several regexes) that are only interested in a small part of the document are much faster. C may be faster, but HTML::Parser does far more work than I need. I'm not trying to re-implement all of HTML::Parser in perl.

Obviously, if you want to grab lots of different data out of the HTML file, my suggested approach would be inappropriate, and HTML::Parser would be better. But say you just want to write a quick script to grab the temperature in Seattle, WA and put it on your home page. Or say you want to mine weather.com for the temperature in various cities, each and each temperature appears on a different page. HTML::Parser on each page would be painfully slow.

But this debate over the speed of HTML::Parser has gotten away from my original intent. This HTML-matching idea could be implemented using HTML::Parser, and then provide an easier, faster to write, and more readable mechanism for matching HTML than writing a new HTML::Parser script. And I submit this idea to all of you...is this interesting and/or useful?

BTW, for FilterProxy I use index/rindex to traverse the document and find the interesting portion, then using pos() and m/\G/g which is much faster than a single regex for the entire document. But again implementation is not what I'm interested in at this point.

Comment on RE: Re: A grammar for HTML matching

Replies are listed 'Best First'.
RE: RE: Re: A grammar for HTML matching by little (Curate) on Nov 01, 2000 at 12:52 UTC
Ehm, no. You can specify that the parser shall give back everything upto the next occurance of a specified token (tag or text or ..). No faster way doin this, except comparing the whole document with your regex, which can't be faster. :-) So easiest seems to tell the parser to look up for `<tr>, next <td>` and so on. See docs for HTML::Parser and HTML:TokeParser. Have a nice day All decision is left to your taste	[reply] [d/l]
RE: RE: RE: Re: A grammar for HTML matching by Anonymous Monk on Nov 02, 2000 at 01:44 UTC
HTML::TokeParser uses HTML::Parser to parse the document for it. So while you can tell HTML::TokeParser to give you the next <img> tag, in reality what it does is lets HTML::Parser call HTML::TokeParser::start for every tag, and then HTML::TokeParser::get_tag examines the tree generated by calling start/end all these times, and only returns the tag you're looking for. It's all these function calls (one for each tag in the document) that make it slower than other possible implementations that look for something specific. Thus HTML::TokeParser cannot possibly be faster than HTML::Parser. (Please correct me if I'm wrong...looking at TokeParser source here...looks like it will do parsing in 512 byte chunks if passed a file rather than a scalar...but the speed hit comes from all the HTML::Parser callbacks for tags you don't care about...so parsing in chunks shouldn't help). If I gave the impression that a regex on an entire document is faster than HTML::Parser (and friends) I apologize, this is obviously incorrect. I have written an implementation for a finder which looks for something specific by traversing the document using index/rindex, then pos() and m/\G/g. This is faster (when looking for something specific) than an equivalent HTML::Parser or HTML::TokeParser implementation. But as I've mentioned elsewhere, it's not the implementation I'm interested in at this point, it's the grammar (and the utility of such a grammar). I'd be perfectly happy with a HTML::TokeParser implementation. (and in fact I will write one -- no use speculating about what is faster when you can write it and measure it)	[reply]