Loop will not save into the array

lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:

Hi to the fellow monks, i am making a script that has two targets.By inputting a URL the script has to search its content, find the links that exist in there and save them in an array and also strip the HTML from the source and save only the text in an array. That is, save one URL's text content into one element of the array. The code is below:

use WWW::Mechanize;
 use URI;
 use HTML::TokeParser;
 
 print "WEB CRAWLER AND HTML EXTRACTOR \n";
 print "Please input the URL of the site to be searched \n";
 print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n";
 
 #Create an instance of the webcrawler
 my $webcrawler = WWW::Mechanize->new();
  
 my $url_name = <STDIN>; # The user inputs the URL to be searched
 
 my $uri = URI->new($url_name); # Process the URL and make it a URI
 
 #Grab the contents of the URL given by the user
 $webcrawler->get($uri);
 
 #Use the HTML::TokeParser module to extract the contents from the web
+site 
 my @stripped_html;  
 my $x = 0;
 my $content = $webcrawler->content;
 my $parser = HTML::TokeParser->new(\$content);
 while($parser->get_tag){
    $stripped_html[0] .= $parser->get_trimmed_text()."\n";
    } 
 
 $stripped_html[0] = join(" ",split " ",$stripped_html[0]); # If we ha
+ve more than one whitespace in a row leave only 1
 print $stripped_html[0]."\n";
 my @website_links = $webcrawler->links; # Put the links that exist in
+ the HTML of the URL given by the user in an array 
 $x = $x + 1;
  
 #The initial URL is stored in an array from which will be used agains
+t the array of URL's to see if a website has been visited before
 my @visited_urls = ($uri);
 my @new_uri;
    
 while (@website_links) { # While the array still has elements(URL's) 
+check the content for links and strip the HTML 
    
    if ((grep {$_ eq $website_links[0] } @visited_urls) > 0) { # If th
+e URL has been visited don't visit again
        shift @website_links; #Remove the URL currently being processe
+d from the list of URL's to visit  
    }
    else { # If the URL hasn't been visited find the links in its cont
+ent, add them to the array of URL'S to visit
           #extract its contents, put them in a string and remove the 
+URL from array of URL's to visit
        
        # The next 6 lines of code are in order to initialize the curr
+ent URL and save the links it has in our array for later proccessing
        $new_uri[0] = URI->new($website_links[0]);
        $webcrawler->get($new_uri[0]);        
        my @links = $webcrawler->links($new_uri[0]);
        push (@website_links,@links); # The URL's that were put in the
+ links array are added to the website_links array
        splice (@links,0,scalar(@links)); #Delete all the elements of 
+the links array
        shift(@new_uri);
        
        # The following is to extract the HTML off the contents and le
+ave only the text in the same way as done from line 45 onwards
        $content = $webcrawler->content;
        $parser = HTML::TokeParser->new(\$content);
            while($parser->get_tag){
                $stripped_html[$x] .= $parser->get_trimmed_text()."\n"
+;
            } 

        $stripped_html[$x] = join(" ",split " ",$stripped_html[$x]);

        push(@visited_urls,$new_uri[0]); #Add the link to the list of 
+already visited URL's    
        $x = $x + 1;
        shift @website_links; # This will remove the URL that has just
+ been processed and the put the next one in queue ready for processin
+g
        print $stripped_html[$x];
        sleep(10);
        }
    }
[download]

As you can see, the code first searches the inputted URL and does what its supposed to do. I have added a print statement so that i could check if the code really works and it works fine, that is, it prints the contents of the website(it also saves the URL' s in the website_links array, each element of the array is one URL).

The problem now is that when i run this script it prints the contents of the inputted URL but when it enters the while loop it does nothing. The while loop is used so that it does the same thing with all the URL's saved in that link but it doesn't seem to do it. After printing the contents it gives me "Use of uninitialized value" warnings until i terminate the program.

Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin
+e 1.
Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin
+e 1.
Use of uninitialized value in string eq at NewWebC.pl line 63, <STDIN>
+ line 1.
Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin
+e 1.
Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin
+e 1.
[download]

And on, and on....

I know this error message crops up when there's nothing written in the variables they are supposed to so it looks like nothing is written in the website_links or the stripped_html arrays.

Any help would be greatly appreciated. I have been working on this for ages but still can't fix it. Thanks

Hi monks, i still haven't managed to get this working. I understand that what Roger is saying is correct, that am probably calling an array reference. The thing is, i haven't got a clue what to do to rectify it and i've been playing around with it for a while. The map Roger said should work doesn't really work and am getting quite desperate with this. Any help would be immensely appreciated.

Comment on Loop will not save into the array Select or Download Code

Replies are listed 'Best First'.
Re: Loop will not save into the array by Roger (Parson) on Aug 17, 2005 at 11:21 UTC
I fancy that converting your code from `$stripped_html[$x] = join(" ",split " ",$stripped_html[$x]);` [download] to `$stripped_html[$x] = join(" ",split(/\s+/,$stripped_html[$x]));` [download] will do the trick. The reason being that if you have more than one white spaces, say, n, the split will generate n-1 undef's in the array, thus the warning. Well, I would probably rewrite this using regular expression: `$stripped_html[$x] =~ s/\s+/ /g;` [download]	[reply] [d/l] [select]
Re^2: Loop will not save into the array by graff (Chancellor) on Aug 18, 2005 at 05:05 UTC
Roger: If you read far enough down in "perldoc -f split", you'll find that `split " ",$string` and `split /\s+/,$string` are in nearly equivalent (except in how they treat initial whitespace in $string). Using a single quoted space character as the split pattern is one of those nifty little bits of perl magic that is meant to save unnecessary typing, giving the same behavior as a bare "split" call with no regex pattern. (But `split / /,$string` -- not using quotes -- really does split on every individual space character.) But I think you're right to suggest that the OP should add the extra parens, to make sure there's no confusion about which args are supposed to go to "split" as opposed to "join".	[reply] [d/l] [select]
Re^2: Loop will not save into the array by lampros21_7 (Scribe) on Aug 17, 2005 at 12:22 UTC
Thanks for your suggestion Roger. Unfortunately, it hasn't done the trick, it will still print the contents of the initial URL and then start giving the uninitialized value warning message. You forgot a ) after the regex in the $stripped_html... var (am mentioning this just to make sure you didn't do it for a special reason). I believe the problem is that for some reason the data will not be written in my two arrays but i can't understand why that is.	[reply]
Re^3: Loop will not save into the array by Roger (Parson) on Aug 17, 2005 at 12:46 UTC
Use `$website_links[0][0]` instead of `$website_links[0]`. Why? Because `$website_links[0]` is an array reference. Hang on, better still, you just need to change one line of code to make it work... From `my @website_links = $webcrawler->links;` [download] to `my @website_links = map { $_->[0] } $webcrawler->links;` [download] Cheers	[reply] [d/l] [select]
Re^4: Loop will not save into the array by lampros21_7 (Scribe) on Aug 17, 2005 at 13:51 UTC
Re: Loop will not save into the array by lidden (Curate) on Aug 17, 2005 at 11:44 UTC
I added a `print "@website_links\n";` as the first thing in the while loop, and I dont think @website_links contains what you expect.	[reply] [d/l]
Re: Loop will not save into the array by GrandFather (Saint) on Aug 17, 2005 at 11:23 UTC
What do you use as your test URL? I can't reproduce your results with either http://www.dcs.shef.ac.uk/ or http://perlmonks.org/?parent=484363;node_id=3333. The second URL generates a text dump for this node, but then goes away for as long as I am prepared to wait before killing it. Perl is Huffman encoded by design.	[reply]
Re^2: Loop will not save into the array by lampros21_7 (Scribe) on Aug 17, 2005 at 12:13 UTC
I use http://www.dcs.shef.ac.uk/ and it works fine for me. I also use http://www.perlmonks.org/ and that works as well. About the question with print the elements of the array it doesn't print tohse because the links are WWW::Mechanize::Link objects so it will not print them as you would expect them.I am pretty sure there is a node in this website that explains how to print them.	[reply]