mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

Here's my web crawler:
#!/usr/bin/perl -w use LWP::Simple; use HTML::SimpleLinkExtor; use Data::Dumper; use LWP::RobotUA; use HTTP::Response; open(LINKS,">>/home/baelnorn/urls.txt"); while(<LINKS>) { chomp $_; my $ua = LWP::RobotUA->new("theusefulbot", "akurtis3 at yahoo.com"); $ua->delay(.10/60); my $content= $ua->get("$_"); my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); my @links=$extor->a; foreach $links (@links) { if($links=~m/^\// and $_=~m/\/$/) { substr($links, 0, 1); my $address="$_ $links"; print LINKS "$address\n"; } else { if($links=~m/^http:\/\/|^www./) { print LINKS "$links\n"; } if($links != ~m/^\// and $_=~m/\/$/) { my $address="$_ $links"; print LINKS "$address\n"; } if($links != ~m/^\// and $_ != ~m/\/$/) { my $address="$_ \ $links"; print LINKS "$address\n"; } } } print $content; }
When I run it it says: Useless use of substr in void context at newcrawler.pl line 23. I think that it is getting trapped in my if statements, which are meant to deal with links like /foo/foo.htm and foo.htm Any ideas, Thanks,

20040220 Edit by BazB: Changed title from 'Web Crawler'

Replies are listed 'Best First'.
Re: Useless use of substr in void context
by jweed (Chaplain) on Feb 20, 2004 at 04:25 UTC
    Well, no one would get "trapped" in you if statements if you indented your code. That would help both you and the raving lunatic who will have to maintain your creation. ;)

    Here, the warning is actually helpful. The thing to note here is that substr doesn't work like splice in that the position you higlight by LENGTH and OFFSET (see the docs) isn't removed it is just identified. So with this substr, you tell perl go to the beginning of the string $links and look at one character. But you neither store that character anywhere or change it, which makes it void context. You don't change the string, and you don't store what you've found. So perl thinks somethings fishy.

    If you want to remove that first character try:
    substr($links, 0, 1) = undef; or even substr($links, 0, 1, undef);


    HTH.



    Code is (almost) always untested.
    http://www.justicepoetic.net/
      yep you got it jweed, thanks. However this code doesn't work, It doesn't print $content or append anything to urls.txt. My Perl is terrible, so I don't understand why. I just cross my fingers and run it. Thanks
        It doesn't print any content because the while at the top exits immediately.

        The while at the top exits immediately because you opened the file for appending (>>) instead of for reading (<).

        I think I see what you intend to do: you want to use the file as a queue, where you append new URLs to the end, and read unprocessed ones from the start (middle).

        A file is the wrong type of a datastructure for that.

        You could use an array, where you push new URLs in at one end, and shift them out at the other.

        Of course, in a real webcrawler, that array is going to get rather large.

Re: Useless use of substr in void context
by atcroft (Abbot) on Feb 20, 2004 at 04:18 UTC

    Line 23 is: substr($lines, 0, 1);

    I suspect the error you are getting is because you are using the three-argument form of substr(), which returns a portion of the string. Because you are not assigning that value to anything, or using it in an expression, I believe this is why you are getting the "Useless use of substring in void context" message. If you are intending to replace a portion of the value, then you may wish to look at the four-argument form; if you do not, you may wish to look closer to determine if the line is needed or now.

    Hope that helps.

Re: Useless use of substr in void context
by chimni (Pilgrim) on Feb 20, 2004 at 04:21 UTC

    Why use substr($links, 0, 1),if all your going to do is throw away the value and not use it anywhere in your code.
    my $sub_extract=substr($links, 0, 1);
    and then use $sub_extract for whatever you want.
    rgds
Re: Useless use of substr in void context
by knowmad (Monk) on Feb 20, 2004 at 04:22 UTC

    Try assigning the result of the substr operation to a variable, e.g.:

    my $sub = substr($links, 0, 1);

    William

Re: Useless use of substr in void context
by tilly (Archbishop) on Feb 21, 2004 at 02:17 UTC
    Why aren't you using LWP::RobotUA so that your bot can politely pay attention to robots.txt like it is supposed to?

    You might also want to run perltidy on your code. Until you do it, it is hard to understand why consistent indentation matters. Once you have learned from experience how much more readable it makes your code, it is painful dealing with people who have not yet aquired the habit.

      i am using LWP::RobotUA. anyway, here is the new "tidy" code
      #!/usr/bin/perl -w use LWP::Simple; use HTML::SimpleLinkExtor; use Data::Dumper; use LWP::RobotUA; use HTTP::Response; open(LINKS,">>/home/baelnorn/urls.txt") || die "$!"; while(<LINKS>) { print"hello"; chomp $_; my $ua = LWP::RobotUA->new("theusefulbot", "akurtis3 at yahoo.com" +); $ua->delay(10/60); my $content= $ua->get($_); my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); my @links=$extor->a; print "start"; foreach $links (@links) { if($links=~m/^\// and $_=~m/\/$/) { substr($links, 0, 1) = undef; print "1"; my $address="$_ $links"; print LINKS "$address\n"; } else { if($links=~m/^http:\/\/|^www./) { print LINKS "$links\n"; } if($links != ~m/^\// and $_=~m/\/$/) { my $address="$_ $links"; print LINKS "$address\n"; } if($links != ~m/^\// and $_ != ~m/\/$/) { my $address="$_ \ $links"; print LINKS "$address\n"; } } } print $content; } close(LINKS);
      that array idea sounds good, but how musch memory would that take up? Does anyone know how i could do it using files? Obviously this isnt going to be a huge webcrawling expedition, but it could end up with 100,000 sites in its array if it hits a very linkish site a couple times. Thanks
        100,000 strings in an array is no big deal these days -- every system has enough virtual memory to cover this and much more. In any case, you cannot open a file for append access and then read from it. No.

        Even if you actually use some operator other than ">>" in the open statement (e.g. "+<" -- check out "perldoc -f open"), you can't alternate between appending to the file and also reading from the beginning or the middle. At least, if that's the sort of file access you really think you want, then you'd have to do something like this (check "perldoc -f seek" and "perldoc -f tell"):

        open( LINKS, "+<", "filename" ); while (<LINKS>) { my $lastread = tell LINKS; # keep track of where you are seek LINKS, 0, 2; # jump to end-of-file ... # do stuff with current value of $_, including: print LINKS "another line of data\n"; ... # when all done adding updates at the end... seek LINKS, $lastread, 0; # jump back for next read: }

        That's one way that might work -- I haven't tested it, but it seems kind of ugly (not to mention badly suited to the task), and I do not recommend it.

        I'd suggest just keeping an array -- in fact, a hash, keyed by url. (Assign a value of "undef" to each hash key as you come across new urls -- it'll use less memory.) That way, you have the ability to skip fetching a page that you've already seen, with something like:

        ... next if exists( $fetched{$url} ); ...

        Use disk storage for the url list only if you discover that an in-memory hash really doesn't fit in memory. And in that case (which is very unlikely), you have a choice between a simple list like you've been trying to do so far, or a tied hash file, like GDBM_File or DB_File (see "perldoc AnyDBM_File"). When I've used DBM's in the past, I've had trouble sometimes when writing new content to a DBM file that's already large -- it can take too long; and I'm not sure how these files play with empty values for the hash keys... hopefully it's a simple matter.

        If a DBM file doesn't work for you, you could still use a list, but I'd suggest an approach like the following (which hasn't been tested either) -- this assumes that you start with a file called "list1", which you have already created by some other means, and which contains some initial list of urls to scrape:

        while (-s "list1") { open( INP, "<list1" ) or die $!; open( OUT, ">list2" ) or die $!; while (<INP>) { chomp; (my $pagefile = $_ ) =~ s{[ \\/?&#!;]+}{_}g; # transform url int +o a safe file name # (update: added backslash to the character set) next if ( -e $pagefile ); # skip if we already have it ... # fetch the page, extract links if any ... foreach $link ( @links ) { ... # work out the full url ... print OUT $url, "\n"; } open PAGE, ">$pagefile"; # save current page to its own file print PAGE $content; close PAGE; } close INP; close OUT; system( "sort -u list2 > list1" ); # rewrite list1 with only the un +ique strings in list2 } print "list1 is empty now. We must be done.\n";
        That will entail a lot of file-system activity: check for file existence on every url, and open/write/close a file for every new page you hit, in addition to rewriting your url list on every iteration. But if you're scanning so many pages that their urls don't all fit in memory, you're going to be spending a lot of run-time no matter what, and doing the extra file i/o to eliminate duplicate fetches might be a win overall (unless you're using really fast internet access and really slow disk).