comment on

Hi to the fellow monks, i am making a script that has two targets.By inputting a URL the script has to search its content, find the links that exist in there and save them in an array and also strip the HTML from the source and save only the text in an array. That is, save one URL's text content into one element of the array. The code is below:

use WWW::Mechanize;
 use URI;
 use HTML::TokeParser;
 
 print "WEB CRAWLER AND HTML EXTRACTOR \n";
 print "Please input the URL of the site to be searched \n";
 print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n";
 
 #Create an instance of the webcrawler
 my $webcrawler = WWW::Mechanize->new();
  
 my $url_name = <STDIN>; # The user inputs the URL to be searched
 
 my $uri = URI->new($url_name); # Process the URL and make it a URI
 
 #Grab the contents of the URL given by the user
 $webcrawler->get($uri);
 
 #Use the HTML::TokeParser module to extract the contents from the web
+site 
 my @stripped_html;  
 my $x = 0;
 my $content = $webcrawler->content;
 my $parser = HTML::TokeParser->new(\$content);
 while($parser->get_tag){
    $stripped_html[0] .= $parser->get_trimmed_text()."\n";
    } 
 
 $stripped_html[0] = join(" ",split " ",$stripped_html[0]); # If we ha
+ve more than one whitespace in a row leave only 1
 print $stripped_html[0]."\n";
 my @website_links = $webcrawler->links; # Put the links that exist in
+ the HTML of the URL given by the user in an array 
 $x = $x + 1;
  
 #The initial URL is stored in an array from which will be used agains
+t the array of URL's to see if a website has been visited before
 my @visited_urls = ($uri);
 my @new_uri;
    
 while (@website_links) { # While the array still has elements(URL's) 
+check the content for links and strip the HTML 
    
    if ((grep {$_ eq $website_links[0] } @visited_urls) > 0) { # If th
+e URL has been visited don't visit again
        shift @website_links; #Remove the URL currently being processe
+d from the list of URL's to visit  
    }
    else { # If the URL hasn't been visited find the links in its cont
+ent, add them to the array of URL'S to visit
           #extract its contents, put them in a string and remove the 
+URL from array of URL's to visit
        
        # The next 6 lines of code are in order to initialize the curr
+ent URL and save the links it has in our array for later proccessing
        $new_uri[0] = URI->new($website_links[0]);
        $webcrawler->get($new_uri[0]);        
        my @links = $webcrawler->links($new_uri[0]);
        push (@website_links,@links); # The URL's that were put in the
+ links array are added to the website_links array
        splice (@links,0,scalar(@links)); #Delete all the elements of 
+the links array
        shift(@new_uri);
        
        # The following is to extract the HTML off the contents and le
+ave only the text in the same way as done from line 45 onwards
        $content = $webcrawler->content;
        $parser = HTML::TokeParser->new(\$content);
            while($parser->get_tag){
                $stripped_html[$x] .= $parser->get_trimmed_text()."\n"
+;
            } 

        $stripped_html[$x] = join(" ",split " ",$stripped_html[$x]);

        push(@visited_urls,$new_uri[0]); #Add the link to the list of 
+already visited URL's    
        $x = $x + 1;
        shift @website_links; # This will remove the URL that has just
+ been processed and the put the next one in queue ready for processin
+g
        print $stripped_html[$x];
        sleep(10);
        }
    }
[download]

As you can see, the code first searches the inputted URL and does what its supposed to do. I have added a print statement so that i could check if the code really works and it works fine, that is, it prints the contents of the website(it also saves the URL' s in the website_links array, each element of the array is one URL).

The problem now is that when i run this script it prints the contents of the inputted URL but when it enters the while loop it does nothing. The while loop is used so that it does the same thing with all the URL's saved in that link but it doesn't seem to do it. After printing the contents it gives me "Use of uninitialized value" warnings until i terminate the program.

Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin
+e 1.
Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin
+e 1.
Use of uninitialized value in string eq at NewWebC.pl line 63, <STDIN>
+ line 1.
Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin
+e 1.
Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin
+e 1.
[download]

And on, and on....

I know this error message crops up when there's nothing written in the variables they are supposed to so it looks like nothing is written in the website_links or the stripped_html arrays.

Any help would be greatly appreciated. I have been working on this for ages but still can't fix it. Thanks

Hi monks, i still haven't managed to get this working. I understand that what Roger is saying is correct, that am probably calling an array reference. The thing is, i haven't got a clue what to do to rectify it and i've been playing around with it for a while. The map Roger said should work doesn't really work and am getting quite desperate with this. Any help would be immensely appreciated.

In reply to Loop will not save into the array by lampros21_7

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.