Parsing badly formed HTML

SilasTheMonk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing badly formed HTML by GrandFather (Saint) on Oct 07, 2008 at 00:12 UTC
Depends somewhat on what you want to do with the data, but HTML::TreeBuilder may be a bit more tolerant of messy HTML. Alternatively, you could run the HTML through HTML::Tidy first to clean it up for subsequent parsing. Perl reduces RSI - it saves typing	[reply]
Re^2: Parsing badly formed HTML by SilasTheMonk (Chaplain) on Oct 07, 2008 at 06:42 UTC
Actually I am using HTML::TreeBuilder and it gives me a string I can work with. It's after that I resort to regular expressions. In a few cases I'm parsing javascript so by that stage I would need a regular expression anyway. It's the fact that XPath would be so much more robust and elegant, though possibly harder to get right in the first instance that concerns me. I tried HTML::Tidy but it did not help (can't remember why just now). The HTML has less than 300 `<tr>` elements of interest to me, but there are several of those that are actually perhaps more robust parsed by regular expression. On the other hand I am likely to be caught out by unexpected attributes and elements.	[reply] [d/l]
Re^3: Parsing badly formed HTML by wfsp (Abbot) on Oct 07, 2008 at 07:50 UTC
If you could give a cut down example of the HTML you are interested in and are having trouble with it would give us something to go on. HTML Tidy/HTML::TreeBuilder is a powerful combination in these cases.	[reply]
Re^3: Parsing badly formed HTML by mirod (Canon) on Oct 07, 2008 at 12:07 UTC
Let's see... you want to use XPath with HTML::TreeBuilder? How about HTML::TreeBuilder::XPath then? ;--) (OK, I'll admit it was easy for me to know about it)	[reply]
Re^4: Parsing badly formed HTML by SilasTheMonk (Chaplain) on Oct 07, 2008 at 19:10 UTC
Re: Parsing badly formed HTML by almut (Canon) on Oct 07, 2008 at 00:14 UTC
You could also give HTML::TokeParser [::Simple] a try.	[reply]
Re: Parsing badly formed HTML by smiffy (Pilgrim) on Oct 07, 2008 at 03:31 UTC
If memory serves me correctly, the ability to handle poorly formed markup is one of the features of HTML::Parser and its children, courtesy of Gisle Aas. I have used this family on a few occasions to extract information from some pretty ghastly markup and have never had any problems. I never even bother trying to use XML::Parser unless I know that the markup is going to be well-formed. (I lie - sometimes I actually use XML::Parser just to see if code is well-formed.)	[reply]
Re: Parsing badly formed HTML by Lawliet (Curate) on Oct 07, 2008 at 00:14 UTC
The first rule for parsing markup languages is 'CPAN, CPAN, CPAN'. Or I guess that would be the first three rules. I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply]
Re: Parsing badly formed HTML by JavaFan (Canon) on Oct 07, 2008 at 00:27 UTC
Well, it's hard to say whether you could have done better. Depending how bad the HTML is formatted (assuming, you mean "incorrect" where you say "bad"), no CPAN module can help you. And even if you find a CPAN module that accepts the first 100 incorrectly formatted HTML documents, it may choke on the next one you give it.	[reply]