in reply to easy HTML::TokeParser help request

Not a TokeParser solution, but using HTML::TreeBuilder I'd use $t->look_down( _tag => "div", class => "full" ) to get a list of the divs you're interested and then call $div->look_down( _tag => 'a' ) on each of those. Sometimes the tree solution's just conceptually easier to get your brane around.

Replies are listed 'Best First'.
Re^2: easy HTML::TokeParser help request
by 2ge (Scribe) on Aug 03, 2006 at 13:27 UTC
    Thanks, I know HTML::TreeBuilder - used it some time ago. But I'd like to use only one parser module, if it is possible, I decided for HTML::TokeParser. I hope exists some solution for this module, also I dont want use hack with regular expressions or so. Maybe I have to use $p->unget_token( @tokens ) to get desired links.

      Of about the last 10 "I need to parse this HTML/XML structure" questions asked here nine of the answers were trivial using ::TreeParser (there are XML and HTML versions) and the other was trivial using XML::Twig.

      Personally I use TreeParser more often in an HTML context and XML::Twig for XHTML and XML. XML::Twig is very powerful for editing, TreeBuilder is very good at looking stuff up.

      At the end of the day the more modules you know a little bit about the more quickly and reliably you get stuff done. Don't be afraid to read documentation! Sometimes a quick question in the CB can save a huge amount of time, if you have a general idea where you are headed in the first place.

      Limiting yourself to a single module is ... limiting! There is no one tool that does every job, not even computers.


      DWIM is Perl's answer to Gödel

      If you're dead set on tokeing it, build a state machine:

      • start in looking_for_full until you see a div with class full, when you transition to looking_for_content
      • when you see a div with class content in state looking_for_content, transition to looking_for_anchors
      • when you see an anchor in looking_for_anchors, save the href attribute
      • when you see a </div> in looking_for_anchors, go back to looking_for_full

      Additional: Little note on implementation: you'd have a $state variable which keeps track of which state you're in (start with my $state = 'looking_for_full';). You'd then have a while( my $t = $stream->get_token ) { ... } loop, inside of which you'd implement the above behaviors. Any non-interesting token for the current state would be ignored (e.g. just next back to fetch the next token).

        Thanks for nice explanation, Fletch. I think this is how it should be done. But...In real life is easier to make regexp for that content I want to parse, so when doesn't exist other simpler solution for HTML::TokeParser, I have to pick up regexs.

        Also, which parser is better ? HTML::TreeBuilder or ? I want learn only one, which is able to parse these relative easy things and has no other glitches...