in reply to Re^2: HTML::Parser fun
in thread HTML::Parser fun

Sorry I didn't include it in the first round. I had to look it up in the parser doc under the html options; XML::LibXML::Parser. There are other options but recover is probably what you need (recover_silently does the same without any warnings to STDERR). It's an argument to new or a method.

# file named 'libxml-html-forgiving' use warnings; use strict; use XML::LibXML; my $corpus = join "", <DATA>; my $parser = XML::LibXML->new(); # give command line an argument to hide errors @ARGV ? $parser->recover_silently(1) : $parser->recover(1); my $doc = $parser->parse_html_string($corpus); print "-" x 60, "\n"; print "parse_html rendered with serialize_html\n"; print "-" x 60, "\n"; print $doc->serialize_html(); print "-" x 60, "\n"; print "parse rendered with serialize_html\n"; print "-" x 60, "\n"; my $doc2 = $parser->parse_string($corpus); print $doc2->serialize_html(); __END__ <p> Some HTML & a <b>problem with it > normal but deadly; <p>

Then run with an arg to suppress errors (which are going to STDERR so they don't interfere with real output either way)-

moo@cow[48]~/bin>perl libxml-html-forgiving 1 ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ <p> Some HTML a problem with it &gt; normal but deadly; </p>

Or without an arg to see all the feedback-

moo@cow[49]~/bin>perl libxml-html-forgiving HTML parser error : htmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ :2: parser error : xmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ :4: parser error : Premature end of data in tag p line 3 ^ :4: parser error : Premature end of data in tag b line 2 ^ :4: parser error : Premature end of data in tag p line 1 ^ <p> Some HTML a problem with it &gt; normal but deadly; </p>

Replies are listed 'Best First'.
Re^4: HTML::Parser fun
by FreakyGreenLeaky (Sexton) on Jun 06, 2008 at 17:48 UTC
    Thank you very much, Your Mother, I must have glossed over that bit in the docs with bleary eyes glassified by bashing my head against the trees of the forest... or something like that.

    I'm going to play around with this over the weekend to get comfy with the idea and if all goes well, it looks like I'll be retooling with XML::LibXML.

    Even though I didn't get my origional question answered about HTML::Parser, it looks like I've learnt something new and better!

      You're most welcome. I don't know if XML::LibXML's a cure-all but it's all I've been using for a couple years for parsing (X)HTML when I don't need a stream (which is most of the time, otherwise I like HTML::TokeParser). It'll even validate documents against DTDs. And as a side-effect of picking it up, you'll find you'll learn other useful stuff like xpath and JS/DOM hacking. Mine improved considerably though learning it.