Getting a website's content

lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:

Hello to all Perl monks. I have written some code which is supposed to get the html content and store it in a string by using the WWW::Robot module. The code goes like this:

use WWW::Robot;
 print "Please input the URL of the site to be searched \n";
 
 my $url_name = <STDIN>; # The user inputs the URL to be searched
 
 #Create an instance of the webcrawler
 my $web_crawler = new WWW::Robot(NAME     =>  'My WebCrawler',
                               VERSION  =>  '1.000',
                               USERAGENT => LWP::UserAgent->new,
                               EMAIL    =>  'aca03lh@sheffield.ac.uk',
                               );

                           
  #Below the attributes of the web crawler are set
  $web_crawler->addHook('invoke-on-all-url', \&invoke_test);
  $web_crawler->addHook('follow-url-test', \&follow_test);
  $web_crawler->addHook('invoke-on-contents', \&invoke_contents); # to
+ be able to get contents from webpages
  $web_crawler->addHook('add-url-test', \&add_url_test); # if url does
+n't exist in array then add for visit
  $web_crawler->addHook('continue-test', \&continue_test); # to exit l
+oop when we run out of URL's to visit
  
 
 
 sub invoke_contents {
    my ($webcrawler, $hook, $url, $response, $structure) = @_;
    our $contents = $structure; #To make the string that has the conte
+nts in global 
 }
 

# Start the web crawling
$web_crawler->run($url_name);

print $contents;
[download]

*********************************

My idea is that the user first inputs the website to be processed(i use http://www.sportinglife.com/) and then the $structure variable in "sub invoke_contents" will be made a global variable. I have put a print command to see if it will print the contents so that i know if it works but it doesn't seem to work really. I have a dial-up connection(believe it or not) and i leave it for about 15 minutes and it doesn't print anything although i don't think it would take that long anyway. Any idea what am i doing wrong?Thanks

Edit g0n: added code tags

Comment on Getting a website's content Download Code

Replies are listed 'Best First'.
Re: Getting a website's content by davidrw (Prior) on Jul 23, 2005 at 00:59 UTC
It's hard to read your code (please use `<code></code>` tags), but it might be easier to just use WWW::Mechanize ... note especially it's methods for getting links. `use WWW::Mechanize; my $mech = WWW::Mechanize->new(); $mech->get($url); print $mech->content; my @links = $mech->all_links(); my @someLinks = $mech->find_all_links( ... );` [download]	[reply] [d/l] [select]
Re: Getting a website's content by marnanel (Beadle) on Jul 23, 2005 at 06:53 UTC
Does it actually terminate after 15 minutes? If not, when does it terminate? Incidentally, it helps to wrap code in <code> tags.	[reply]
Re^2: Getting a website's content by lampros21_7 (Scribe) on Jul 23, 2005 at 13:35 UTC
It doesn't terminate after 15 minutes i just get fed up and close the command prompt window. Am not too sure about WWW::Mechanize as i would want to do a web crawler which means that all the links it finds they would have to be stored and then check if the first link stored has been visited before and if it hasn't then visit it and get its html content too. Thanks.	[reply]
Re^3: Getting a website's content by marnanel (Beadle) on Jul 24, 2005 at 20:11 UTC
It could be recursively getting pages further and further into the hierarchy. I don't know WWW::Robot too well, but you probably want to write something for follow-url-test to see.	[reply]
Re: Getting a website's content by Anonymous Monk on Jul 23, 2005 at 09:05 UTC
it doesn't print anything although i don't think it would take that long anyway. It could take forever. Use LWP::Debug to see whats going on.	[reply]