lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:
use WWW::Mechanize; use URI; use HTML::TokeParser; print "WEB CRAWLER AND HTML EXTRACTOR \n"; print "Please input the URL of the site to be searched \n"; print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = <STDIN>; # The user inputs the URL to be searched my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); #Use the HTML::TokeParser module to extract the contents from the web +site my @stripped_html; my $x = 0; my $content = $webcrawler->content; my $parser = HTML::TokeParser->new(\$content); while($parser->get_tag){ $stripped_html[0] .= $parser->get_trimmed_text()."\n"; } $stripped_html[0] = join(" ",split " ",$stripped_html[0]); # If we ha +ve more than one whitespace in a row leave only 1 print $stripped_html[0]."\n"; my @website_links = $webcrawler->links; # Put the links that exist in + the HTML of the URL given by the user in an array $x = $x + 1; #The initial URL is stored in an array from which will be used agains +t the array of URL's to see if a website has been visited before my @visited_urls = ($uri); my @new_uri; while (@website_links) { # While the array still has elements(URL's) +check the content for links and strip the HTML if ((grep {$_ eq $website_links[0] } @visited_urls) > 0) { # If th +e URL has been visited don't visit again shift @website_links; #Remove the URL currently being processe +d from the list of URL's to visit } else { # If the URL hasn't been visited find the links in its cont +ent, add them to the array of URL'S to visit #extract its contents, put them in a string and remove the +URL from array of URL's to visit # The next 6 lines of code are in order to initialize the curr +ent URL and save the links it has in our array for later proccessing $new_uri[0] = URI->new($website_links[0]); $webcrawler->get($new_uri[0]); my @links = $webcrawler->links($new_uri[0]); push (@website_links,@links); # The URL's that were put in the + links array are added to the website_links array splice (@links,0,scalar(@links)); #Delete all the elements of +the links array shift(@new_uri); # The following is to extract the HTML off the contents and le +ave only the text in the same way as done from line 45 onwards $content = $webcrawler->content; $parser = HTML::TokeParser->new(\$content); while($parser->get_tag){ $stripped_html[$x] .= $parser->get_trimmed_text()."\n" +; } $stripped_html[$x] = join(" ",split " ",$stripped_html[$x]); push(@visited_urls,$new_uri[0]); #Add the link to the list of +already visited URL's $x = $x + 1; shift @website_links; # This will remove the URL that has just + been processed and the put the next one in queue ready for processin +g print $stripped_html[$x]; sleep(10); } }
As you can see, the code first searches the inputted URL and does what its supposed to do. I have added a print statement so that i could check if the code really works and it works fine, that is, it prints the contents of the website(it also saves the URL' s in the website_links array, each element of the array is one URL).
The problem now is that when i run this script it prints the contents of the inputted URL but when it enters the while loop it does nothing. The while loop is used so that it does the same thing with all the URL's saved in that link but it doesn't seem to do it. After printing the contents it gives me "Use of uninitialized value" warnings until i terminate the program.
Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin +e 1. Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin +e 1. Use of uninitialized value in string eq at NewWebC.pl line 63, <STDIN> + line 1. Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin +e 1. Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin +e 1.
I know this error message crops up when there's nothing written in the variables they are supposed to so it looks like nothing is written in the website_links or the stripped_html arrays.
Any help would be greatly appreciated. I have been working on this for ages but still can't fix it. Thanks
Hi monks, i still haven't managed to get this working. I understand that what Roger is saying is correct, that am probably calling an array reference. The thing is, i haven't got a clue what to do to rectify it and i've been playing around with it for a while. The map Roger said should work doesn't really work and am getting quite desperate with this. Any help would be immensely appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Loop will not save into the array
by Roger (Parson) on Aug 17, 2005 at 11:21 UTC | |
by graff (Chancellor) on Aug 18, 2005 at 05:05 UTC | |
by lampros21_7 (Scribe) on Aug 17, 2005 at 12:22 UTC | |
by Roger (Parson) on Aug 17, 2005 at 12:46 UTC | |
by lampros21_7 (Scribe) on Aug 17, 2005 at 13:51 UTC | |
|
Re: Loop will not save into the array
by lidden (Curate) on Aug 17, 2005 at 11:44 UTC | |
|
Re: Loop will not save into the array
by GrandFather (Saint) on Aug 17, 2005 at 11:23 UTC | |
by lampros21_7 (Scribe) on Aug 17, 2005 at 12:13 UTC |