Need help to parse dynamic page ....

sprakash has asked for the wisdom of the Perl Monks concerning the following question:

Gurus, I need your help -- I need to be able to parse a dynamic webpage. For example, if i got yahoo.com and search for "soup", I will get a bunch of results ... the url in the location bar of IE will change to something relevant to my search. Now I need to write code where I pass it this url (with the search criteria, etc) and the script will find all the urls on the results page and save them to file. I've been trying to get LWP and CGI... trying to get my feet wet with the following code (which should get a urls title):

#!/usr/bin/perl
use CGI;
use LWP::Simple;
use HTML::TokeParser;

$cgiobject=new CGI;
$cgiobject->use_named_parameters;
print $cgiobject->header;

print $cgiobject->start_html
                  (-title=>'Page Parser',
                   -bgcolor=>'white');
                          

print $cgiobject->startform
                  (-method=>'get',
                   -action=>'parsepage.pl');
print "URL to Analyze:".$cgiobject->textfield
                                    (-name=>'url',
                                     -size=>'40');
print "<br>".$cgiobject->submit(-value=>'Analyze');
print $cgiobject->endform;
print "<hr>";                                     


#retrieve web page
$fetchURL=$cgiobject->param("url");
unless ($fetchURL) 
 {$fetchURL="www.yahoo.com"}

$webPage=get($fetchURL);

print <<ENDHTML;
<center><h2>$fetchURL<br>$webpage<br>
has been sliced and diced,
 thus revealing:</h2></center>
ENDHTML
                                      
&parse_title;
print $cgiobject->end_html;


sub parse_title{
#parse and output page title
$parser=HTML::TokeParser->new(shift||$webPage);
$parser->get_tag("title");
print "<p><h2>Page title</h2> ".
      $parser->get_trimmed_text."</p>";
}
[download]

BUT it gives me this error ..... "Undefined subroutine CGI::use_named_parameters at parsepage.pl line 7". If I comment out line 7, the script doesnt do jack. Any help/advice would be greatly apprecaited. Thanks, NeedPerlWisdomGuy

Comment on Need help to parse dynamic page .... Download Code

Replies are listed 'Best First'.
Re: Need help to parse dynamic page .... by dingus (Friar) on Nov 13, 2002 at 08:34 UTC
I think this could be your problem: `$fetchURL=$cgiobject->param("url"); unless ($fetchURL) {$fetchURL="www.yahoo.com"} $webPage=get($fetchURL);` [download] You need to correctly specify the URL as http://... I suggest breaking the thing down in to bits. First build a perl program that can retrieve a URL and display the content. Then build a CGI script that builds the HTML form etc. and just displays the source of any web page retrieved (hint `$content =~ s/</</g; print pre($content)` is a quick and dirty way of putting what you have retrieved in a form that you can see it on a browser). Finally build the processing code. Dingus `Enter any 47-digit prime number to continue.`	[reply] [d/l] [select]
Re: Need help to parse dynamic page .... by grantm (Parson) on Nov 13, 2002 at 10:16 UTC
You might also want to take a look at WWW::Search::Yahoo	[reply]
Re: Need help to parse dynamic page .... by Zaxo (Archbishop) on Nov 13, 2002 at 08:18 UTC
The only occurance of `use_named_parameters()` in CGI.pm is in a comment explaining some automatic argument twiddling in the definition of `&CGI::param`. Was your script written for some elderly version of CGI? Try commenting out the bogus line 7 and see what you get. After Compline, Zaxo	[reply]
Re: Need help to parse dynamic page .... by lestrrat (Deacon) on Nov 13, 2002 at 08:07 UTC
Where did you get this `use_named_parameters()` subroutine from? I looked at the CGI docs and googled, but I couldn't find anything. So I'm inclined to say that that particular error has nothing to do with the fact that your script "doesn't do jack". Just delete it. Also, "doesn't do jack" isn't exactly useful for debugging. You will probably have a better luck getting responses if you describe what you do (did you just load the page? did you pass some parameters to the CGI?), what kind of results you expected, and what you actually got	[reply] [d/l]
Re: Need help to parse dynamic page .... by Ryszard (Priest) on Nov 13, 2002 at 10:08 UTC
This may not be of such great help, but i'm doing something similar in concept as you are, but am using HTML::TableExtract with LWP::Simple. It took me a bit of time to find the location of the information in the table (it's a complex HTML made up of many nested tables) i'm parsing, however, once i got that, i could extract the information quite easily. The documentation (of HTML::TableExtract took me a little while to work out, but after i did, i could create my own data structure to do with what i liked.	[reply]
Re: Need help to parse dynamic page .... by Fletch (Bishop) on Nov 13, 2002 at 21:36 UTC
Get a copy of Perl and LWP (ISBN 0596001789). Learn it. Live it. Love it. It covers all sorts of approaches to web "screen scraping" and how to use the myriad of modules that make it easy. After reading it I wrote several programs (that took less than 20 minutes each to whip up) that grab various pages and spit out RSS for my personal aggregator page. Well worth the money ($25-ish from Amazon; or if you're an ORA Safari subscriber you can read it online).	[reply]