Re: Parsing badly formed HTML

Replies are listed 'Best First'.
Re^2: Parsing badly formed HTML by SilasTheMonk (Chaplain) on Oct 07, 2008 at 06:42 UTC
Actually I am using HTML::TreeBuilder and it gives me a string I can work with. It's after that I resort to regular expressions. In a few cases I'm parsing javascript so by that stage I would need a regular expression anyway. It's the fact that XPath would be so much more robust and elegant, though possibly harder to get right in the first instance that concerns me. I tried HTML::Tidy but it did not help (can't remember why just now). The HTML has less than 300 `<tr>` elements of interest to me, but there are several of those that are actually perhaps more robust parsed by regular expression. On the other hand I am likely to be caught out by unexpected attributes and elements.	[reply] [d/l]
Re^3: Parsing badly formed HTML by wfsp (Abbot) on Oct 07, 2008 at 07:50 UTC
If you could give a cut down example of the HTML you are interested in and are having trouble with it would give us something to go on. HTML Tidy/HTML::TreeBuilder is a powerful combination in these cases.	[reply]
Re^3: Parsing badly formed HTML by mirod (Canon) on Oct 07, 2008 at 12:07 UTC
Let's see... you want to use XPath with HTML::TreeBuilder? How about HTML::TreeBuilder::XPath then? ;--) (OK, I'll admit it was easy for me to know about it)	[reply]
Re^4: Parsing badly formed HTML by SilasTheMonk (Chaplain) on Oct 07, 2008 at 19:10 UTC
I have parsed the pages now so I am happy. However I would NOT want to use that approach again so I really appreciate all the advice. HTML::TreeBuilder::XPath looks like the way to go. Edit: 27-1-2009 I need to write a similar script and this time I will have to deal with nested tables. Sounds like this module is going to be essential.	[reply]