in reply to HTML::Parser fun

Just to show you how darn easy, accurate, deep, and fast some of this is with XML::LibXML...

#!/usr/bin/perl use strict; use warnings; use XML::LibXML; # This is a shortcut, see the docs for more formal usage. my $doc = XML::LibXML->new->parse_html_fh(*DATA); my $root = $doc->getDocumentElement; my ( $head ) = $root->findnodes("head"); my ( $body ) = $root->findnodes("body"); print "Head stuff...\n"; for my $refresh ( $head->findnodes('meta[@http-equiv]') ) { print "\t", $refresh->getAttribute("content"), "\n"; } print "\nBody stuff...\n"; for my $link ( $body->findnodes('a[@href]') ) { printf("%25s --> %s\n", $link->textContent || $link->getAttribute("title") || "n/a" +, $link->getAttribute("href") ); } # print $doc->serialize(1); __DATA__ PUT YOUR HTML DOCUMENT DOWN HERE. Took it out for space.

Reproducing the same output/report format you want is left as an excercise for the reader. :) The docs for the family of modules are terse but quite good once you see the big picture. There are options to allow more liberal/broken HTML to be parsed (or attempted anyway).

Replies are listed 'Best First'.
Re^2: HTML::Parser fun
by FreakyGreenLeaky (Sexton) on Jun 05, 2008 at 14:54 UTC
    Thanks for the info, Your Mother

    I've been testing XML::LibXML with various HTML files (our corpus has various sizes) to get some benchmarks, and I must say, it's surprisingly quick (except for really large files, which isn't really relevant in my case), however:
    • this is a deal-killer: the HTML must be balanced with nice </x> closing tags (which it's often not in the real world), else it croaks without producing any output (HTML::Parser tolerates this kind of thing).
    HTML::Parser soldiers on despite missing tags, etc, and still produces useful output (required in our app).

    Some (unscientific) benchmarks:

    104KB HTML file processed 100 times (average of 3 runs)
    HTML::Parser: ~20s
    XML::LibXML: ~13s

    371KB HTML file processed 100 times
    HTML::Parser: ~51s
    XML::LibXML: ~30s

    550KB HTML file processed 100 times
    HTML::Parser: ~73s
    XML::LibXML: ~49s

    4.3MB HTML file processed once (silly, but interesting in a huh? kind of way)
    HTML::Parser: ~4s
    XML::LibXML: ~85s

    Conclusion: it looks like XML::LibXML is the way to go. My only concern (the reason preventing me from switching over to XML::LibXML) is how to get it to be tolerant of lazy/broken HTML the way HTML::Parser is.

    I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test how tolerant it is).

      Sorry I didn't include it in the first round. I had to look it up in the parser doc under the html options; XML::LibXML::Parser. There are other options but recover is probably what you need (recover_silently does the same without any warnings to STDERR). It's an argument to new or a method.

      # file named 'libxml-html-forgiving' use warnings; use strict; use XML::LibXML; my $corpus = join "", <DATA>; my $parser = XML::LibXML->new(); # give command line an argument to hide errors @ARGV ? $parser->recover_silently(1) : $parser->recover(1); my $doc = $parser->parse_html_string($corpus); print "-" x 60, "\n"; print "parse_html rendered with serialize_html\n"; print "-" x 60, "\n"; print $doc->serialize_html(); print "-" x 60, "\n"; print "parse rendered with serialize_html\n"; print "-" x 60, "\n"; my $doc2 = $parser->parse_string($corpus); print $doc2->serialize_html(); __END__ <p> Some HTML & a <b>problem with it > normal but deadly; <p>

      Then run with an arg to suppress errors (which are going to STDERR so they don't interfere with real output either way)-

      moo@cow[48]~/bin>perl libxml-html-forgiving 1 ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ <p> Some HTML a problem with it &gt; normal but deadly; </p>

      Or without an arg to see all the feedback-

      moo@cow[49]~/bin>perl libxml-html-forgiving HTML parser error : htmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ :2: parser error : xmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ :4: parser error : Premature end of data in tag p line 3 ^ :4: parser error : Premature end of data in tag b line 2 ^ :4: parser error : Premature end of data in tag p line 1 ^ <p> Some HTML a problem with it &gt; normal but deadly; </p>
        Thank you very much, Your Mother, I must have glossed over that bit in the docs with bleary eyes glassified by bashing my head against the trees of the forest... or something like that.

        I'm going to play around with this over the weekend to get comfy with the idea and if all goes well, it looks like I'll be retooling with XML::LibXML.

        Even though I didn't get my origional question answered about HTML::Parser, it looks like I've learnt something new and better!
      I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test it and see how tolerant it is).

      You can't. At least not in Perl. XML::LibXML uses libxml2, which does the XML, and HTML, parsing. That's what you would need to change.

      For the record, when I wanted to add HTML parsing to XML::Twig, I looked at HTML::Parser, XML::LibXML and tidy, and settled on HTML::Parser as the most robust and easy to use solution to get well-formed XML out of random HTML.

        Yes, creamygoodness put me onto HTML::Parser some time ago, and I'm finding it hard to look back.

        I then wonder why Your Mother suggested "There are options to allow more liberal/broken HTML to be parsed (or attempted anyway)."?

        I wonder what options he/she was referring to?

        Any idea?
Re^2: HTML::Parser fun
by FreakyGreenLeaky (Sexton) on Jun 05, 2008 at 09:35 UTC
    Thanks! I'll give that a try to see how it stacks up against HTML::Parser when crunching a MBs of test HTML.