Parsing web data by tag... help?

SpacemanSpiff has asked for the wisdom of the Perl Monks concerning the following question:

In an attempt at brevity, here's what I'm using to grab text on a page between any given set of tags (<ul> being the example in this case):

#!/usr/bin/perl -w
use strict;

  use WWW::Mechanize;
  use HTML::TokeParser;
  
  my $agent = WWW::Mechanize->new();
  $agent->get("http://www.perlmonks.com/");
  
  my $stream = HTML::TokeParser->new(\$agent->{content});
  $stream->get_tag("ul");
[download]

For all I know, that works just fine. Being that I want to verify that before I move on, how do I view the contents of $stream? A print function simply gives me "HTML::TokeParser=HASH(0x22dcd84)" as I'm suspecting it should. Past experiences have taught me that sometimes "as_text" or a garden variety of other text commands will format it as I expect, but I've been unable to figure it out for TokeParser. Can someone illustrate what's happening and how I can deal with this in the future? I'm sure my next problem beyond this will be cookie handling, but I'll leave that for another time. (Yes, I've rtfm, but being new, the answer has thus far escaped me) Thanks in advance!

Comment on Parsing web data by tag... help? Select or Download Code

Replies are listed 'Best First'.
Re: Parsing web data by tag... help? by davido (Cardinal) on Sep 06, 2005 at 16:28 UTC
You should print the return value of `$stream->get_text('/ul');`. There are examples in the documentation for HTML::TokeParser. If all you want to do is dump the object, you can `print Dumper $stream;` while using Data::Dumper, but I don't think this will really tell you what you need. Update: What you're probably missing in TFM is that the way you're using `get_tag('ul')` simply sets the parsing pointer to the first `<ul>` tag it finds, but doesn't capture the subsequent text. `get_text('/ul')` will capture all of the text from the current position (the current `<ul>` tag) until the `</ul>` tag is found. Dave	[reply] [d/l] [select]
Re: Parsing web data by tag... help? by Zaxo (Archbishop) on Sep 06, 2005 at 16:32 UTC
According to the pod, the `get_tag` method returns an array reference, with the text as the last element of the array. You can get that by saying, `my $text = $stream->get_tag("ul")->[-1];` [download] After Compline, Zaxo	[reply] [d/l]
Re: Parsing web data by tag... help? by davidrw (Prior) on Sep 06, 2005 at 16:32 UTC
looking at HTML::TokeParser's pod, looks like just a simple call to the `get_trimmed_text()` method will get what you want.. be sure to look at the two examples in the docs there -- i think they're both applicable...	[reply] [d/l]
Re: Parsing web data by tag... help? by Ovid (Cardinal) on Sep 06, 2005 at 18:46 UTC
Write a "peek" function that will print the next `$count` items in the stream and uses the `unget_token` method to restore your parser to its original state. Here's my take on it. It should be easy enough to munge this for your needs. `#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple 3.13; my $url = 'http://www.perlmonks.com/'; my $parser = HTML::TokeParser::Simple->new( url => $url ); $parser->get_tag("ul"); print peek($parser); sub peek { my $parser = shift; my $count = shift \|\| 5; my $items = 0; my $html = ''; my @tokens; while ( ( my $token = $parser->get_token ) && $items++ < $count ) +{ $html .= $token->as_is; push @tokens, $token; } $parser->unget_token(@tokens); return $html; }` [download] You know what? I like this so much I should probably add it to `HTML::TokeParser::Simple`. Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re^2: Parsing web data by tag... help? by SpacemanSpiff (Sexton) on Sep 06, 2005 at 21:08 UTC
Just wanted to thank everyone for the prompt replies. Even though I've read it in the manual, practical examples and explanations are helping put it all together. For whatever reason, what wasn't working before now seems to when I try it again. Must be some divine power from this place... Anyway, thanks for the patience. I've picked up the llama and camel (the whole bookshelf CD kit at that) and hope to keep the dumb questions to a minimum on my journey to become a Perl monk(ey).	[reply]