Thanks, I know HTML::TreeBuilder - used it some time ago. But I'd like to use only one parser module, if it is possible, I decided for HTML::TokeParser. I hope exists some solution for this module, also I dont want use hack with regular expressions or so. Maybe I have to use $p->unget_token( @tokens ) to get desired links. | [reply] |
Of about the last 10 "I need to parse this HTML/XML structure" questions asked here nine of the answers were trivial using ::TreeParser (there are XML and HTML versions) and the other was trivial using XML::Twig.
Personally I use TreeParser more often in an HTML context and XML::Twig for XHTML and XML. XML::Twig is very powerful for editing, TreeBuilder is very good at looking stuff up.
At the end of the day the more modules you know a little bit about the more quickly and reliably you get stuff done. Don't be afraid to read documentation! Sometimes a quick question in the CB can save a huge amount of time, if you have a general idea where you are headed in the first place.
Limiting yourself to a single module is ... limiting! There is no one tool that does every job, not even computers.
DWIM is Perl's answer to Gödel
| [reply] |
If you're dead set on tokeing it, build a state machine:
- start in looking_for_full until you see a div with class full, when you transition to looking_for_content
- when you see a div with class content in state looking_for_content, transition to looking_for_anchors
- when you see an anchor in looking_for_anchors, save the href attribute
- when you see a </div> in looking_for_anchors, go back to looking_for_full
Additional: Little note on implementation: you'd have a $state variable which keeps track of which state you're in (start with my $state = 'looking_for_full';). You'd then have a while( my $t = $stream->get_token ) { ... } loop, inside of which you'd implement the above behaviors. Any non-interesting token for the current state would be ignored (e.g. just next back to fetch the next token).
| [reply] [d/l] [select] |
Thanks for nice explanation, Fletch. I think this is how it should be done. But...In real life is easier to make regexp for that content I want to parse, so when doesn't exist other simpler solution for HTML::TokeParser, I have to pick up regexs.
Also, which parser is better ? HTML::TreeBuilder or ? I want learn only one, which is able to parse these relative easy things and has no other glitches...
| [reply] |