in reply to Re: HTML::Parser to extract link text?
in thread HTML::Parser to extract link text?

Thank you for your opinion.

I made quite an effort to benchmark various versions against each other, and now that I am done I wouldn't like to go back and do it again.

My results were that using HTML::Parser is the fastest solution, beating LinkExtor and LinkExtractor. My guess is that it is so fast because both link-specific modules are based on Parser. Thus using the underlying lib is even faster.

Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right?

So: back to the original: Any comments on how I get Parser to do text instead of dtext??
  • Comment on Re^2: HTML::Parser to extract link text?

Replies are listed 'Best First'.
Re^3: HTML::Parser to extract link text?
by Juerd (Abbot) on Jun 19, 2007 at 21:23 UTC

    Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right?

    Probably. I'm curious, how many HTML pages will you be parsing per second in your finished product?

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      This questions does not exactly apply. ;-) But for the link-extraction part, Parser took around 0.08secs where Extor did 0.12 and Extractor 0.28secs (for average html).

      Any help with my Parser question?

        This questions does not exactly apply. ;-) But for the link-extraction part, Parser took around 0.08secs where Extor did 0.12 and Extractor 0.28secs (for average html).

        Okay, let me rephrase then... Why do you need such high performance for your project?

        Any help with my Parser question?

        No, sorry, I kind of promised myself to no longer waste time with HTML::Parser. It is a great module, but low level, and I can solve any HTML parsing problem much faster with higher level modules like HTML::TreeBuilder.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }