in reply to Parsing badly formed HTML

Depends somewhat on what you want to do with the data, but HTML::TreeBuilder may be a bit more tolerant of messy HTML. Alternatively, you could run the HTML through HTML::Tidy first to clean it up for subsequent parsing.


Perl reduces RSI - it saves typing

Replies are listed 'Best First'.
Re^2: Parsing badly formed HTML
by SilasTheMonk (Chaplain) on Oct 07, 2008 at 06:42 UTC
    Actually I am using HTML::TreeBuilder and it gives me a string I can work with. It's after that I resort to regular expressions. In a few cases I'm parsing javascript so by that stage I would need a regular expression anyway. It's the fact that XPath would be so much more robust and elegant, though possibly harder to get right in the first instance that concerns me. I tried HTML::Tidy but it did not help (can't remember why just now). The HTML has less than 300 <tr> elements of interest to me, but there are several of those that are actually perhaps more robust parsed by regular expression. On the other hand I am likely to be caught out by unexpected attributes and elements.
      If you could give a cut down example of the HTML you are interested in and are having trouble with it would give us something to go on.

      HTML Tidy/HTML::TreeBuilder is a powerful combination in these cases.

      Let's see... you want to use XPath with HTML::TreeBuilder? How about HTML::TreeBuilder::XPath then? ;--) (OK, I'll admit it was easy for me to know about it)

        I have parsed the pages now so I am happy. However I would NOT want to use that approach again so I really appreciate all the advice. HTML::TreeBuilder::XPath looks like the way to go.

        Edit: 27-1-2009 I need to write a similar script and this time I will have to deal with nested tables. Sounds like this module is going to be essential.