RE: RE: Re: A grammar for HTML matching

Ehm, no.
You can specify that the parser shall give back everything upto the next occurance of a specified token (tag or text or ..).
No faster way doin this, except comparing the whole document with your regex, which can't be faster. :-)
So easiest seems to tell the parser to look up for <tr>, next <td> and so on.
See docs for HTML::Parser and HTML:TokeParser.

Have a nice day
All decision is left to your taste

Comment on RE: RE: Re: A grammar for HTML matching Download Code

Replies are listed 'Best First'.
RE: RE: RE: Re: A grammar for HTML matching by Anonymous Monk on Nov 02, 2000 at 01:44 UTC
HTML::TokeParser uses HTML::Parser to parse the document for it. So while you can tell HTML::TokeParser to give you the next <img> tag, in reality what it does is lets HTML::Parser call HTML::TokeParser::start for every tag, and then HTML::TokeParser::get_tag examines the tree generated by calling start/end all these times, and only returns the tag you're looking for. It's all these function calls (one for each tag in the document) that make it slower than other possible implementations that look for something specific. Thus HTML::TokeParser cannot possibly be faster than HTML::Parser. (Please correct me if I'm wrong...looking at TokeParser source here...looks like it will do parsing in 512 byte chunks if passed a file rather than a scalar...but the speed hit comes from all the HTML::Parser callbacks for tags you don't care about...so parsing in chunks shouldn't help). If I gave the impression that a regex on an entire document is faster than HTML::Parser (and friends) I apologize, this is obviously incorrect. I have written an implementation for a finder which looks for something specific by traversing the document using index/rindex, then pos() and m/\G/g. This is faster (when looking for something specific) than an equivalent HTML::Parser or HTML::TokeParser implementation. But as I've mentioned elsewhere, it's not the implementation I'm interested in at this point, it's the grammar (and the utility of such a grammar). I'd be perfectly happy with a HTML::TokeParser implementation. (and in fact I will write one -- no use speculating about what is faster when you can write it and measure it)	[reply]

Replies are listed 'Best First'.

RE: RE: RE: Re: A grammar for HTML matching
by Anonymous Monk on Nov 02, 2000 at 01:44 UTC

Thus HTML::TokeParser cannot possibly be faster than HTML::Parser. (Please correct me if I'm wrong...looking at TokeParser source here...looks like it will do parsing in 512 byte chunks if passed a file rather than a scalar...but the speed hit comes from all the HTML::Parser callbacks for tags you don't care about...so parsing in chunks shouldn't help).

If I gave the impression that a regex on an entire document is faster than HTML::Parser (and friends) I apologize, this is obviously incorrect. I have written an implementation for a finder which looks for something specific by traversing the document using index/rindex, then pos() and m/\G/g. This is faster (when looking for something specific) than an equivalent HTML::Parser or HTML::TokeParser implementation.

But as I've mentioned elsewhere, it's not the implementation I'm interested in at this point, it's the grammar (and the utility of such a grammar). I'd be perfectly happy with a HTML::TokeParser implementation. (and in fact I will write one -- no use speculating about what is faster when you can write it and measure it)

[reply]