SpacemanSpiff has asked for the wisdom of the Perl Monks concerning the following question:

In an attempt at brevity, here's what I'm using to grab text on a page between any given set of tags (<ul> being the example in this case):
#!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTML::TokeParser; my $agent = WWW::Mechanize->new(); $agent->get("http://www.perlmonks.com/"); my $stream = HTML::TokeParser->new(\$agent->{content}); $stream->get_tag("ul");
For all I know, that works just fine. Being that I want to verify that before I move on, how do I view the contents of $stream? A print function simply gives me "HTML::TokeParser=HASH(0x22dcd84)" as I'm suspecting it should. Past experiences have taught me that sometimes "as_text" or a garden variety of other text commands will format it as I expect, but I've been unable to figure it out for TokeParser. Can someone illustrate what's happening and how I can deal with this in the future? I'm sure my next problem beyond this will be cookie handling, but I'll leave that for another time. (Yes, I've rtfm, but being new, the answer has thus far escaped me) Thanks in advance!

Replies are listed 'Best First'.
Re: Parsing web data by tag... help?
by davido (Cardinal) on Sep 06, 2005 at 16:28 UTC

    You should print the return value of $stream->get_text('/ul');. There are examples in the documentation for HTML::TokeParser.

    If all you want to do is dump the object, you can print Dumper $stream; while using Data::Dumper, but I don't think this will really tell you what you need.

    Update: What you're probably missing in TFM is that the way you're using get_tag('ul') simply sets the parsing pointer to the first <ul> tag it finds, but doesn't capture the subsequent text. get_text('/ul') will capture all of the text from the current position (the current <ul> tag) until the </ul> tag is found.


    Dave

Re: Parsing web data by tag... help?
by Zaxo (Archbishop) on Sep 06, 2005 at 16:32 UTC

    According to the pod, the get_tag method returns an array reference, with the text as the last element of the array. You can get that by saying,

    my $text = $stream->get_tag("ul")->[-1];

    After Compline,
    Zaxo

Re: Parsing web data by tag... help?
by davidrw (Prior) on Sep 06, 2005 at 16:32 UTC
    looking at HTML::TokeParser's pod, looks like just a simple call to the get_trimmed_text() method will get what you want.. be sure to look at the two examples in the docs there -- i think they're both applicable...
Re: Parsing web data by tag... help?
by Ovid (Cardinal) on Sep 06, 2005 at 18:46 UTC

    Write a "peek" function that will print the next $count items in the stream and uses the unget_token method to restore your parser to its original state. Here's my take on it. It should be easy enough to munge this for your needs.

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple 3.13; my $url = 'http://www.perlmonks.com/'; my $parser = HTML::TokeParser::Simple->new( url => $url ); $parser->get_tag("ul"); print peek($parser); sub peek { my $parser = shift; my $count = shift || 5; my $items = 0; my $html = ''; my @tokens; while ( ( my $token = $parser->get_token ) && $items++ < $count ) +{ $html .= $token->as_is; push @tokens, $token; } $parser->unget_token(@tokens); return $html; }

    You know what? I like this so much I should probably add it to HTML::TokeParser::Simple.

    Cheers,
    Ovid

    New address of my CGI Course.

      Just wanted to thank everyone for the prompt replies. Even though I've read it in the manual, practical examples and explanations are helping put it all together. For whatever reason, what wasn't working before now seems to when I try it again. Must be some divine power from this place... Anyway, thanks for the patience. I've picked up the llama and camel (the whole bookshelf CD kit at that) and hope to keep the dumb questions to a minimum on my journey to become a Perl monk(ey).