Re: Parsing badly formed HTML
by GrandFather (Saint) on Oct 07, 2008 at 00:12 UTC
|
Depends somewhat on what you want to do with the data, but HTML::TreeBuilder may be a bit more tolerant of messy HTML. Alternatively, you could run the HTML through HTML::Tidy first to clean it up for subsequent parsing.
Perl reduces RSI - it saves typing
| [reply] |
|
|
Actually I am using HTML::TreeBuilder and it gives me a string I can work with. It's after that I resort to regular expressions. In a few cases I'm parsing javascript so by that stage I would need a regular expression anyway. It's the fact that XPath would be so much more robust and elegant, though possibly harder to get right in the first instance that concerns me. I tried HTML::Tidy but it did not help (can't remember why just now). The HTML has less than 300 <tr> elements of interest to me,
but there are several of those that are actually perhaps more robust parsed by regular expression. On the other hand I am likely to be caught out by unexpected attributes and elements.
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
|
|
Re: Parsing badly formed HTML
by almut (Canon) on Oct 07, 2008 at 00:14 UTC
|
| [reply] |
Re: Parsing badly formed HTML
by smiffy (Pilgrim) on Oct 07, 2008 at 03:31 UTC
|
If memory serves me correctly, the ability to handle poorly formed markup is one of the features of HTML::Parser and its children, courtesy of Gisle Aas. I have used this family on a few occasions to extract information from some pretty ghastly markup and have never had any problems.
I never even bother trying to use XML::Parser unless I know that the markup is going to be well-formed. (I lie - sometimes I actually use XML::Parser just to see if code is well-formed.)
| [reply] |
Re: Parsing badly formed HTML
by Lawliet (Curate) on Oct 07, 2008 at 00:14 UTC
|
The first rule for parsing markup languages is 'CPAN, CPAN, CPAN'. Or I guess that would be the first three rules.
I'm so adjective, I verb nouns! chomp; # nom nom nom
| [reply] |
Re: Parsing badly formed HTML
by JavaFan (Canon) on Oct 07, 2008 at 00:27 UTC
|
Well, it's hard to say whether you could have done better. Depending how bad the HTML is formatted (assuming, you mean "incorrect" where you say "bad"), no CPAN module can help you. And even if you find a CPAN module that accepts the first 100 incorrectly formatted HTML documents, it may choke on the next one you give it. | [reply] |