HTML::TokeParser uses HTML::Parser to parse the document for it. So while you can tell HTML::TokeParser to give you the next <img> tag, in reality what it does is lets HTML::Parser call HTML::TokeParser::start for every tag, and then HTML::TokeParser::get_tag examines the tree generated by calling start/end all these times, and only returns the tag you're looking for. It's all these function calls (one for each tag in the document) that make it slower than other possible implementations that look for something specific.

Thus HTML::TokeParser cannot possibly be faster than HTML::Parser. (Please correct me if I'm wrong...looking at TokeParser source here...looks like it will do parsing in 512 byte chunks if passed a file rather than a scalar...but the speed hit comes from all the HTML::Parser callbacks for tags you don't care about...so parsing in chunks shouldn't help).

If I gave the impression that a regex on an entire document is faster than HTML::Parser (and friends) I apologize, this is obviously incorrect. I have written an implementation for a finder which looks for something specific by traversing the document using index/rindex, then pos() and m/\G/g. This is faster (when looking for something specific) than an equivalent HTML::Parser or HTML::TokeParser implementation.

But as I've mentioned elsewhere, it's not the implementation I'm interested in at this point, it's the grammar (and the utility of such a grammar). I'd be perfectly happy with a HTML::TokeParser implementation. (and in fact I will write one -- no use speculating about what is faster when you can write it and measure it)


In reply to RE: RE: RE: Re: A grammar for HTML matching by Anonymous Monk
in thread A grammar for HTML matching by mcelrath

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.