Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a question regarding HTML::Treebuilder look_down. Its not working when we refer to <header> or <article> etc.... What could be the Isssue?

Regards,

Aanamayakki

  • Comment on HTML::Treebuilder look_down not working with <header>, <article> etc

Replies are listed 'Best First'.
Re: HTML::Treebuilder look_down not working with <header>, <article> etc
by NetWallah (Canon) on Apr 19, 2013 at 04:17 UTC
            Its not working when we refer to <header> or <article> etc.... What could be the Isssue?

    We wonder too :

    What could be the code that causes this "issue" ?
    What could be the meaning of "not working" ?

                 "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'"
            -- Dr. Cox, Scrubs

      To clarify. I'm not the orginal poster but I got the same problem.
      my $tree = HTML::TreeBuilder->new_from_content($webcrawler->content()) +; if (my $div = $tree->look_down(_tag => "article" )) { print $div->as_text(), "\n"; } else { print "Not found"; }
      This piece of code gives a "Not found" on this article: http://www.sueddeutsche.de/politik/thailand-regierungschefin-yingluck-verliert-ihr-amt-1.1953299 although there is an article tag To test the code I changed it to grab a piece in the article tag itself:
      if (my $div = $tree->look_down(_tag => "p" , class=>"article entry-sum +mary")) { print $div->as_text(), "\n"; } else { print "Not found"; }
      It worked as expected and printed me "Das höchste Gericht in Thailand hat entschieden: Regierungschefin Yingluck Shinawatra ist des Verfassungsbruchs schuldig. Sie wurde sofort ihres Amtes enthoben. " So I can't seem to grab the article tag itself. Since article is an html5 tag this might be the problem but how can I solve this another way?
        And you're not telling treebuilder to keep unknown tags because?
Re: HTML::Treebuilder look_down not working with <header>, <article> etc
by Anonymous Monk on Apr 19, 2013 at 08:10 UTC
      Erm. Yes. I still don't get it. Anyone can explain that code further?

        <header> and <article> are not HTML tags. HTML::TreeBuilder seems to ignore tags that are not HTML tags. So I would guess that HTML::TreeBuilder is the wrong tool to process things that are not HTML.

        The rest of the post of Anonymous Monk shows you an alternative approach using XML::LibXML. Maybe you should pursue that one.

        Erm. Yes. I still don't get it. Anyone can explain that code further?

        Sure, if you can explain what -- its like saying "doesn't work" ... great, ok then, have a nice day