Re: easy HTML::TokeParser help request

Replies are listed 'Best First'.
Re^2: easy HTML::TokeParser help request by 2ge (Scribe) on Aug 03, 2006 at 13:27 UTC
Thanks, I know HTML::TreeBuilder - used it some time ago. But I'd like to use only one parser module, if it is possible, I decided for HTML::TokeParser. I hope exists some solution for this module, also I dont want use hack with regular expressions or so. Maybe I have to use $p->unget_token( @tokens ) to get desired links.	[reply]
Re^3: easy HTML::TokeParser help request by GrandFather (Saint) on Aug 03, 2006 at 19:35 UTC
Of about the last 10 "I need to parse this HTML/XML structure" questions asked here nine of the answers were trivial using ::TreeParser (there are XML and HTML versions) and the other was trivial using XML::Twig. Personally I use TreeParser more often in an HTML context and XML::Twig for XHTML and XML. XML::Twig is very powerful for editing, TreeBuilder is very good at looking stuff up. At the end of the day the more modules you know a little bit about the more quickly and reliably you get stuff done. Don't be afraid to read documentation! Sometimes a quick question in the CB can save a huge amount of time, if you have a general idea where you are headed in the first place. Limiting yourself to a single module is ... limiting! There is no one tool that does every job, not even computers. DWIM is Perl's answer to Gödel	[reply]
Re^3: easy HTML::TokeParser help request by Fletch (Bishop) on Aug 03, 2006 at 13:33 UTC
If you're dead set on tokeing it, build a state machine: start in `looking_for_full` until you see a div with class full, when you transition to `looking_for_content` when you see a div with class content in state `looking_for_content`, transition to `looking_for_anchors` when you see an anchor in `looking_for_anchors`, save the href attribute when you see a `</div>` in `looking_for_anchors`, go back to `looking_for_full` Additional: Little note on implementation: you'd have a `$state` variable which keeps track of which state you're in (start with `my $state = 'looking_for_full';`). You'd then have a `while( my $t = $stream->get_token ) { ... }` loop, inside of which you'd implement the above behaviors. Any non-interesting token for the current state would be ignored (e.g. just `next` back to fetch the next token).	[reply] [d/l] [select]
Re^4: easy HTML::TokeParser help request by 2ge (Scribe) on Aug 03, 2006 at 14:21 UTC
Thanks for nice explanation, Fletch. I think this is how it should be done. But...In real life is easier to make regexp for that content I want to parse, so when doesn't exist other simpler solution for HTML::TokeParser, I have to pick up regexs. Also, which parser is better ? HTML::TreeBuilder or ? I want learn only one, which is able to parse these relative easy things and has no other glitches...	[reply]
Re^5: easy HTML::TokeParser help request by Fletch (Bishop) on Aug 03, 2006 at 14:42 UTC
Re^6: easy HTML::TokeParser help request by 2ge (Scribe) on Aug 04, 2006 at 12:20 UTC