lampros21_7 has asked for the wisdom of the Perl Monks concerning the following question:

Hi to the fellow monks, i am making a script that has two targets.By inputting a URL the script has to search its content, find the links that exist in there and save them in an array and also strip the HTML from the source and save only the text in an array. That is, save one URL's text content into one element of the array. The code is below:

use WWW::Mechanize; use URI; use HTML::TokeParser; print "WEB CRAWLER AND HTML EXTRACTOR \n"; print "Please input the URL of the site to be searched \n"; print "Please use a full URL (eg. http://www.dcs.shef.ac.uk/) \n"; #Create an instance of the webcrawler my $webcrawler = WWW::Mechanize->new(); my $url_name = <STDIN>; # The user inputs the URL to be searched my $uri = URI->new($url_name); # Process the URL and make it a URI #Grab the contents of the URL given by the user $webcrawler->get($uri); #Use the HTML::TokeParser module to extract the contents from the web +site my @stripped_html; my $x = 0; my $content = $webcrawler->content; my $parser = HTML::TokeParser->new(\$content); while($parser->get_tag){ $stripped_html[0] .= $parser->get_trimmed_text()."\n"; } $stripped_html[0] = join(" ",split " ",$stripped_html[0]); # If we ha +ve more than one whitespace in a row leave only 1 print $stripped_html[0]."\n"; my @website_links = $webcrawler->links; # Put the links that exist in + the HTML of the URL given by the user in an array $x = $x + 1; #The initial URL is stored in an array from which will be used agains +t the array of URL's to see if a website has been visited before my @visited_urls = ($uri); my @new_uri; while (@website_links) { # While the array still has elements(URL's) +check the content for links and strip the HTML if ((grep {$_ eq $website_links[0] } @visited_urls) > 0) { # If th +e URL has been visited don't visit again shift @website_links; #Remove the URL currently being processe +d from the list of URL's to visit } else { # If the URL hasn't been visited find the links in its cont +ent, add them to the array of URL'S to visit #extract its contents, put them in a string and remove the +URL from array of URL's to visit # The next 6 lines of code are in order to initialize the curr +ent URL and save the links it has in our array for later proccessing $new_uri[0] = URI->new($website_links[0]); $webcrawler->get($new_uri[0]); my @links = $webcrawler->links($new_uri[0]); push (@website_links,@links); # The URL's that were put in the + links array are added to the website_links array splice (@links,0,scalar(@links)); #Delete all the elements of +the links array shift(@new_uri); # The following is to extract the HTML off the contents and le +ave only the text in the same way as done from line 45 onwards $content = $webcrawler->content; $parser = HTML::TokeParser->new(\$content); while($parser->get_tag){ $stripped_html[$x] .= $parser->get_trimmed_text()."\n" +; } $stripped_html[$x] = join(" ",split " ",$stripped_html[$x]); push(@visited_urls,$new_uri[0]); #Add the link to the list of +already visited URL's $x = $x + 1; shift @website_links; # This will remove the URL that has just + been processed and the put the next one in queue ready for processin +g print $stripped_html[$x]; sleep(10); } }

As you can see, the code first searches the inputted URL and does what its supposed to do. I have added a print statement so that i could check if the code really works and it works fine, that is, it prints the contents of the website(it also saves the URL' s in the website_links array, each element of the array is one URL).

The problem now is that when i run this script it prints the contents of the inputted URL but when it enters the while loop it does nothing. The while loop is used so that it does the same thing with all the URL's saved in that link but it doesn't seem to do it. After printing the contents it gives me "Use of uninitialized value" warnings until i terminate the program.

Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin +e 1. Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin +e 1. Use of uninitialized value in string eq at NewWebC.pl line 63, <STDIN> + line 1. Use of uninitialized value in split at NewWebC.pl line 83, <STDIN> lin +e 1. Use of uninitialized value in print at NewWebC.pl line 88, <STDIN> lin +e 1.
And on, and on....

I know this error message crops up when there's nothing written in the variables they are supposed to so it looks like nothing is written in the website_links or the stripped_html arrays.

Any help would be greatly appreciated. I have been working on this for ages but still can't fix it. Thanks

Hi monks, i still haven't managed to get this working. I understand that what Roger is saying is correct, that am probably calling an array reference. The thing is, i haven't got a clue what to do to rectify it and i've been playing around with it for a while. The map Roger said should work doesn't really work and am getting quite desperate with this. Any help would be immensely appreciated.

Replies are listed 'Best First'.
Re: Loop will not save into the array
by Roger (Parson) on Aug 17, 2005 at 11:21 UTC
    I fancy that converting your code from
    $stripped_html[$x] = join(" ",split " ",$stripped_html[$x]);
    to
    $stripped_html[$x] = join(" ",split(/\s+/,$stripped_html[$x]));
    will do the trick. The reason being that if you have more than one white spaces, say, n, the split will generate n-1 undef's in the array, thus the warning.

    Well, I would probably rewrite this using regular expression:
    $stripped_html[$x] =~ s/\s+/ /g;


      Roger: If you read far enough down in "perldoc -f split", you'll find that  split " ",$string and  split /\s+/,$string are in nearly equivalent (except in how they treat initial whitespace in $string).

      Using a single quoted space character as the split pattern is one of those nifty little bits of perl magic that is meant to save unnecessary typing, giving the same behavior as a bare "split" call with no regex pattern. (But  split / /,$string -- not using quotes -- really does split on every individual space character.)

      But I think you're right to suggest that the OP should add the extra parens, to make sure there's no confusion about which args are supposed to go to "split" as opposed to "join".

      Thanks for your suggestion Roger. Unfortunately, it hasn't done the trick, it will still print the contents of the initial URL and then start giving the uninitialized value warning message. You forgot a ) after the regex in the $stripped_html... var (am mentioning this just to make sure you didn't do it for a special reason). I believe the problem is that for some reason the data will not be written in my two arrays but i can't understand why that is.
        Use $website_links[0][0] instead of $website_links[0]. Why? Because $website_links[0] is an array reference.

        Hang on, better still, you just need to change one line of code to make it work...

        From
        my @website_links = $webcrawler->links;
        to
        my @website_links = map { $_->[0] } $webcrawler->links;

        Cheers

Re: Loop will not save into the array
by lidden (Curate) on Aug 17, 2005 at 11:44 UTC
    I added a print "@website_links\n"; as the first thing in the while loop, and I dont think @website_links contains what you expect.
Re: Loop will not save into the array
by GrandFather (Saint) on Aug 17, 2005 at 11:23 UTC

    What do you use as your test URL? I can't reproduce your results with either http://www.dcs.shef.ac.uk/ or http://perlmonks.org/?parent=484363;node_id=3333.

    The second URL generates a text dump for this node, but then goes away for as long as I am prepared to wait before killing it.


    Perl is Huffman encoded by design.
      I use http://www.dcs.shef.ac.uk/ and it works fine for me. I also use http://www.perlmonks.org/ and that works as well. About the question with print the elements of the array it doesn't print tohse because the links are WWW::Mechanize::Link objects so it will not print them as you would expect them.I am pretty sure there is a node in this website that explains how to print them.