easy HTML::TokeParser help request

2ge has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: easy HTML::TokeParser help request by Fletch (Bishop) on Aug 03, 2006 at 13:16 UTC
Not a TokeParser solution, but using HTML::TreeBuilder I'd use `$t->look_down( _tag => "div", class => "full" )` to get a list of the divs you're interested and then call `$div->look_down( _tag => 'a' )` on each of those. Sometimes the tree solution's just conceptually easier to get your brane around.	[reply] [d/l] [select]
Re^2: easy HTML::TokeParser help request by 2ge (Scribe) on Aug 03, 2006 at 13:27 UTC
Thanks, I know HTML::TreeBuilder - used it some time ago. But I'd like to use only one parser module, if it is possible, I decided for HTML::TokeParser. I hope exists some solution for this module, also I dont want use hack with regular expressions or so. Maybe I have to use $p->unget_token( @tokens ) to get desired links.	[reply]
Re^3: easy HTML::TokeParser help request by GrandFather (Saint) on Aug 03, 2006 at 19:35 UTC
Of about the last 10 "I need to parse this HTML/XML structure" questions asked here nine of the answers were trivial using ::TreeParser (there are XML and HTML versions) and the other was trivial using XML::Twig. Personally I use TreeParser more often in an HTML context and XML::Twig for XHTML and XML. XML::Twig is very powerful for editing, TreeBuilder is very good at looking stuff up. At the end of the day the more modules you know a little bit about the more quickly and reliably you get stuff done. Don't be afraid to read documentation! Sometimes a quick question in the CB can save a huge amount of time, if you have a general idea where you are headed in the first place. Limiting yourself to a single module is ... limiting! There is no one tool that does every job, not even computers. DWIM is Perl's answer to Gödel	[reply]
Re^3: easy HTML::TokeParser help request by Fletch (Bishop) on Aug 03, 2006 at 13:33 UTC
If you're dead set on tokeing it, build a state machine: start in `looking_for_full` until you see a div with class full, when you transition to `looking_for_content` when you see a div with class content in state `looking_for_content`, transition to `looking_for_anchors` when you see an anchor in `looking_for_anchors`, save the href attribute when you see a `</div>` in `looking_for_anchors`, go back to `looking_for_full` Additional: Little note on implementation: you'd have a `$state` variable which keeps track of which state you're in (start with `my $state = 'looking_for_full';`). You'd then have a `while( my $t = $stream->get_token ) { ... }` loop, inside of which you'd implement the above behaviors. Any non-interesting token for the current state would be ignored (e.g. just `next` back to fetch the next token).	[reply] [d/l] [select]
Re^4: easy HTML::TokeParser help request by 2ge (Scribe) on Aug 03, 2006 at 14:21 UTC
Re^5: easy HTML::TokeParser help request by Fletch (Bishop) on Aug 03, 2006 at 14:42 UTC
Some notes below your chosen depth have not been shown here
Re: easy HTML::TokeParser help request by wfsp (Abbot) on Aug 03, 2006 at 16:21 UTC
With HTML::TokeParser::Simple (assumes all links within class="full") #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do {local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse: $!"; my ($in_full, @href); while (my $t = $p->get_token){ next if $t->is_start_tag('div') and $t->get_attr('class') and $t->get_attr('class') eq 'content'; $in_full++, next if $t->is_start_tag('div') and $t->get_attr('class') eq 'full'; $in_full = 0, next if $t->is_start_tag('div') and $t->get_attr('class') ne 'full'; next unless $in_full; push @href, $t->get_attr('href') if $t->is_start_tag('a'); } print "$_\n" for @href; __DATA__ <!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar">foobar</a></li> <li><a href="foobr2">fobar2</a></li> <li><a href="fobar3">foobr3</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar4">foobar4</a></li> <li><a href="foobr5">fobar5</a></li> <li><a href="fobar6">foobr6</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here --> [download]	[reply] [d/l]
Re: easy HTML::TokeParser help request by un-chomp (Scribe) on Aug 03, 2006 at 14:56 UTC
Relies on well-formed HTML: #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; my $doc = do { local $/; <DATA> }; my $p = HTML::TokeParser->new( \$doc ); while ( my $outer = $p->get_tag("div") ) { next unless $outer->[1]{class} eq "full"; my $nested_div = 0; while ( my $inner = $p->get_tag ) { # keep count of nested divs $nested_div++ if $inner->[0] eq "div"; $nested_div-- if $inner->[0] eq "/div"; # "full" div has closed last if $nested_div == -1; print $p->get_text, "\n" if $inner->[0] eq "a"; } } __DATA__ <!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar">foobar</a></li> <li><a href="foobr2">fobar2</a></li> <li><a href="fobar3">foobr3</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here --> [download]	[reply] [d/l]
Re^2: easy HTML::TokeParser help request by 2ge (Scribe) on Aug 04, 2006 at 12:12 UTC
Thanks! Very very nice solution, easy to understand and it works like it should. I give you ++	[reply]