2ge has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys,

I am using this great module for parsing, I like it, it is quite easy, but now I get into trouble - really dont know how to parse following:
<!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar">foobar</a></li> <li><a href="foobr2">fobar2</a></li> <li><a href="fobar3">foobr3</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here -->

I read all nodes posted here, also tutorial. Main trouble is, I dont know create while() loop only for <div class="full"> html code - if I knew this, I would write such a parser.

thanks for any help

Replies are listed 'Best First'.
Re: easy HTML::TokeParser help request
by Fletch (Bishop) on Aug 03, 2006 at 13:16 UTC

    Not a TokeParser solution, but using HTML::TreeBuilder I'd use $t->look_down( _tag => "div", class => "full" ) to get a list of the divs you're interested and then call $div->look_down( _tag => 'a' ) on each of those. Sometimes the tree solution's just conceptually easier to get your brane around.

      Thanks, I know HTML::TreeBuilder - used it some time ago. But I'd like to use only one parser module, if it is possible, I decided for HTML::TokeParser. I hope exists some solution for this module, also I dont want use hack with regular expressions or so. Maybe I have to use $p->unget_token( @tokens ) to get desired links.

        Of about the last 10 "I need to parse this HTML/XML structure" questions asked here nine of the answers were trivial using ::TreeParser (there are XML and HTML versions) and the other was trivial using XML::Twig.

        Personally I use TreeParser more often in an HTML context and XML::Twig for XHTML and XML. XML::Twig is very powerful for editing, TreeBuilder is very good at looking stuff up.

        At the end of the day the more modules you know a little bit about the more quickly and reliably you get stuff done. Don't be afraid to read documentation! Sometimes a quick question in the CB can save a huge amount of time, if you have a general idea where you are headed in the first place.

        Limiting yourself to a single module is ... limiting! There is no one tool that does every job, not even computers.


        DWIM is Perl's answer to Gödel

        If you're dead set on tokeing it, build a state machine:

        • start in looking_for_full until you see a div with class full, when you transition to looking_for_content
        • when you see a div with class content in state looking_for_content, transition to looking_for_anchors
        • when you see an anchor in looking_for_anchors, save the href attribute
        • when you see a </div> in looking_for_anchors, go back to looking_for_full

        Additional: Little note on implementation: you'd have a $state variable which keeps track of which state you're in (start with my $state = 'looking_for_full';). You'd then have a while( my $t = $stream->get_token ) { ... } loop, inside of which you'd implement the above behaviors. Any non-interesting token for the current state would be ignored (e.g. just next back to fetch the next token).

Re: easy HTML::TokeParser help request
by wfsp (Abbot) on Aug 03, 2006 at 16:21 UTC
    With HTML::TokeParser::Simple

    (assumes all links within class="full")

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do {local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse: $!"; my ($in_full, @href); while (my $t = $p->get_token){ next if $t->is_start_tag('div') and $t->get_attr('class') and $t->get_attr('class') eq 'content'; $in_full++, next if $t->is_start_tag('div') and $t->get_attr('class') eq 'full'; $in_full = 0, next if $t->is_start_tag('div') and $t->get_attr('class') ne 'full'; next unless $in_full; push @href, $t->get_attr('href') if $t->is_start_tag('a'); } print "$_\n" for @href; __DATA__ <!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar">foobar</a></li> <li><a href="foobr2">fobar2</a></li> <li><a href="fobar3">foobr3</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar4">foobar4</a></li> <li><a href="foobr5">fobar5</a></li> <li><a href="fobar6">foobr6</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here -->
Re: easy HTML::TokeParser help request
by un-chomp (Scribe) on Aug 03, 2006 at 14:56 UTC
    Relies on well-formed HTML:
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; my $doc = do { local $/; <DATA> }; my $p = HTML::TokeParser->new( \$doc ); while ( my $outer = $p->get_tag("div") ) { next unless $outer->[1]{class} eq "full"; my $nested_div = 0; while ( my $inner = $p->get_tag ) { # keep count of nested divs $nested_div++ if $inner->[0] eq "div"; $nested_div-- if $inner->[0] eq "/div"; # "full" div has closed last if $nested_div == -1; print $p->get_text, "\n" if $inner->[0] eq "a"; } } __DATA__ <!-- some other divs and so here --> <div class="full"> <div class="content"> <ul class="topics"> <-- I want extract these links div class "full" only --> <li><a href="foobar">foobar</a></li> <li><a href="foobr2">fobar2</a></li> <li><a href="fobar3">foobr3</a></li> </ul> </div> </div> <div class="otherclass"> <div class="content"> <ul class="topics"> <-- I DO NOT WANT these links --> <li><a href="fbaor">fbaor</a></li> <li><a href="fabar2">fabar2</a></li> <li><a href="fbar3">fbar3</a></li> </ul> </div> </div> <!-- some other divs and so here -->
      Thanks! Very very nice solution, easy to understand and it works like it should. I give you ++