in reply to Re: Parsing badly formed HTML
in thread Parsing badly formed HTML

Actually I am using HTML::TreeBuilder and it gives me a string I can work with. It's after that I resort to regular expressions. In a few cases I'm parsing javascript so by that stage I would need a regular expression anyway. It's the fact that XPath would be so much more robust and elegant, though possibly harder to get right in the first instance that concerns me. I tried HTML::Tidy but it did not help (can't remember why just now). The HTML has less than 300 <tr> elements of interest to me, but there are several of those that are actually perhaps more robust parsed by regular expression. On the other hand I am likely to be caught out by unexpected attributes and elements.

Replies are listed 'Best First'.
Re^3: Parsing badly formed HTML
by wfsp (Abbot) on Oct 07, 2008 at 07:50 UTC
    If you could give a cut down example of the HTML you are interested in and are having trouble with it would give us something to go on.

    HTML Tidy/HTML::TreeBuilder is a powerful combination in these cases.

Re^3: Parsing badly formed HTML
by mirod (Canon) on Oct 07, 2008 at 12:07 UTC

    Let's see... you want to use XPath with HTML::TreeBuilder? How about HTML::TreeBuilder::XPath then? ;--) (OK, I'll admit it was easy for me to know about it)

      I have parsed the pages now so I am happy. However I would NOT want to use that approach again so I really appreciate all the advice. HTML::TreeBuilder::XPath looks like the way to go.

      Edit: 27-1-2009 I need to write a similar script and this time I will have to deal with nested tables. Sounds like this module is going to be essential.