HTML::Treebuilder look_down not working with <header>, <article> etc

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML::Treebuilder look_down not working with <header>, <article> etc by NetWallah (Canon) on Apr 19, 2013 at 04:17 UTC
Its not working when we refer to <header> or <article> etc.... What could be the Isssue? We wonder too : What could be the code that causes this "issue" ? What could be the meaning of "not working" ? "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'" -- Dr. Cox, Scrubs	[reply]
Re^2: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on May 07, 2014 at 11:38 UTC
To clarify. I'm not the orginal poster but I got the same problem. `my $tree = HTML::TreeBuilder->new_from_content($webcrawler->content()) +; if (my $div = $tree->look_down(_tag => "article" )) { print $div->as_text(), "\n"; } else { print "Not found"; }` [download] This piece of code gives a "Not found" on this article: http://www.sueddeutsche.de/politik/thailand-regierungschefin-yingluck-verliert-ihr-amt-1.1953299 although there is an article tag To test the code I changed it to grab a piece in the article tag itself: `if (my $div = $tree->look_down(_tag => "p" , class=>"article entry-sum +mary")) { print $div->as_text(), "\n"; } else { print "Not found"; }` [download] It worked as expected and printed me "Das höchste Gericht in Thailand hat entschieden: Regierungschefin Yingluck Shinawatra ist des Verfassungsbruchs schuldig. Sie wurde sofort ihres Amtes enthoben. " So I can't seem to grab the article tag itself. Since article is an html5 tag this might be the problem but how can I solve this another way?	[reply] [d/l] [select]
Re^3: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on May 07, 2014 at 12:45 UTC
And you're not telling treebuilder to keep unknown tags because?	[reply]
Re^4: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on May 07, 2014 at 14:36 UTC
Re: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on Apr 19, 2013 at 08:10 UTC
What could be the Isssue? Not enough manual reading/understanding $ perl -MHTML::Tree -e " print HTML::Tree->new_from_content(q{<header> +<article>})->as_HTML" <html><head></head><body></body></html> $ perl -MHTML::Tree -le " print HTML::Tree->new(qw{ ignore_unknown 0 } +)->parse_content(q{<header><article>})->as_HTML" <html><head></head><body></body><header><article></article></header></ +html> $ perl -MHTML::HTML5::Parser -le " print HTML::HTML5::Parser->load_htm +l( string => \q{<header><article>}) " <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><header><artic +le/></header></body></html> $ perl -MMojo::DOM -le " print Mojo::DOM->new( q{<header><article>} ) +" <header><article></article></header> [download] $ perl -MXML::LibXML -le " print XML::LibXML->new( qw/ recover 2 / )->load_html( string => q{<header><article>} ); " <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><header><article/></header></body></html> </c> htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions, Extract Portion of HTML	[reply] [d/l]
Re^2: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on May 07, 2014 at 10:51 UTC
Erm. Yes. I still don't get it. Anyone can explain that code further?	[reply]
Re^3: HTML::Treebuilder look_down not working with <header>, <article> etc by Corion (Patriarch) on May 07, 2014 at 10:56 UTC
`<header>` and `<article>` are not HTML tags. HTML::TreeBuilder seems to ignore tags that are not HTML tags. So I would guess that HTML::TreeBuilder is the wrong tool to process things that are not HTML. The rest of the post of Anonymous Monk shows you an alternative approach using XML::LibXML. Maybe you should pursue that one.	[reply] [d/l] [select]
Re^4: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on May 07, 2014 at 11:08 UTC
Re^3: HTML::Treebuilder look_down not working with <header>, <article> etc by Anonymous Monk on May 07, 2014 at 10:54 UTC
Erm. Yes. I still don't get it. Anyone can explain that code further? Sure, if you can explain what -- its like saying "doesn't work" ... great, ok then, have a nice day	[reply]