SilasTheMonk has asked for the wisdom of the Perl Monks concerning the following question:

My current piece of work is scraping various bits of information off a web page. I found that the HTML is far from XHTML and XML::LibXML chokes. I have probably gone too far down the regular expression route now, but was there a better way of handling this?

Replies are listed 'Best First'.
Re: Parsing badly formed HTML
by GrandFather (Saint) on Oct 07, 2008 at 00:12 UTC

    Depends somewhat on what you want to do with the data, but HTML::TreeBuilder may be a bit more tolerant of messy HTML. Alternatively, you could run the HTML through HTML::Tidy first to clean it up for subsequent parsing.


    Perl reduces RSI - it saves typing
      Actually I am using HTML::TreeBuilder and it gives me a string I can work with. It's after that I resort to regular expressions. In a few cases I'm parsing javascript so by that stage I would need a regular expression anyway. It's the fact that XPath would be so much more robust and elegant, though possibly harder to get right in the first instance that concerns me. I tried HTML::Tidy but it did not help (can't remember why just now). The HTML has less than 300 <tr> elements of interest to me, but there are several of those that are actually perhaps more robust parsed by regular expression. On the other hand I am likely to be caught out by unexpected attributes and elements.
        If you could give a cut down example of the HTML you are interested in and are having trouble with it would give us something to go on.

        HTML Tidy/HTML::TreeBuilder is a powerful combination in these cases.

        Let's see... you want to use XPath with HTML::TreeBuilder? How about HTML::TreeBuilder::XPath then? ;--) (OK, I'll admit it was easy for me to know about it)

Re: Parsing badly formed HTML
by almut (Canon) on Oct 07, 2008 at 00:14 UTC
Re: Parsing badly formed HTML
by smiffy (Pilgrim) on Oct 07, 2008 at 03:31 UTC

    If memory serves me correctly, the ability to handle poorly formed markup is one of the features of HTML::Parser and its children, courtesy of Gisle Aas. I have used this family on a few occasions to extract information from some pretty ghastly markup and have never had any problems.

    I never even bother trying to use XML::Parser unless I know that the markup is going to be well-formed. (I lie - sometimes I actually use XML::Parser just to see if code is well-formed.)

Re: Parsing badly formed HTML
by Lawliet (Curate) on Oct 07, 2008 at 00:14 UTC

    The first rule for parsing markup languages is 'CPAN, CPAN, CPAN'. Or I guess that would be the first three rules.

    I'm so adjective, I verb nouns!

    chomp; # nom nom nom

Re: Parsing badly formed HTML
by JavaFan (Canon) on Oct 07, 2008 at 00:27 UTC
    Well, it's hard to say whether you could have done better. Depending how bad the HTML is formatted (assuming, you mean "incorrect" where you say "bad"), no CPAN module can help you. And even if you find a CPAN module that accepts the first 100 incorrectly formatted HTML documents, it may choke on the next one you give it.