in reply to Useless use of substr in void context

Why aren't you using LWP::RobotUA so that your bot can politely pay attention to robots.txt like it is supposed to?

You might also want to run perltidy on your code. Until you do it, it is hard to understand why consistent indentation matters. Once you have learned from experience how much more readable it makes your code, it is painful dealing with people who have not yet aquired the habit.

  • Comment on Re: Useless use of substr in void context

Replies are listed 'Best First'.
Re: Re: Useless use of substr in void context
by mkurtis (Scribe) on Feb 21, 2004 at 16:29 UTC
    i am using LWP::RobotUA. anyway, here is the new "tidy" code
    #!/usr/bin/perl -w use LWP::Simple; use HTML::SimpleLinkExtor; use Data::Dumper; use LWP::RobotUA; use HTTP::Response; open(LINKS,">>/home/baelnorn/urls.txt") || die "$!"; while(<LINKS>) { print"hello"; chomp $_; my $ua = LWP::RobotUA->new("theusefulbot", "akurtis3 at yahoo.com" +); $ua->delay(10/60); my $content= $ua->get($_); my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); my @links=$extor->a; print "start"; foreach $links (@links) { if($links=~m/^\// and $_=~m/\/$/) { substr($links, 0, 1) = undef; print "1"; my $address="$_ $links"; print LINKS "$address\n"; } else { if($links=~m/^http:\/\/|^www./) { print LINKS "$links\n"; } if($links != ~m/^\// and $_=~m/\/$/) { my $address="$_ $links"; print LINKS "$address\n"; } if($links != ~m/^\// and $_ != ~m/\/$/) { my $address="$_ \ $links"; print LINKS "$address\n"; } } } print $content; } close(LINKS);
    that array idea sounds good, but how musch memory would that take up? Does anyone know how i could do it using files? Obviously this isnt going to be a huge webcrawling expedition, but it could end up with 100,000 sites in its array if it hits a very linkish site a couple times. Thanks
      100,000 strings in an array is no big deal these days -- every system has enough virtual memory to cover this and much more. In any case, you cannot open a file for append access and then read from it. No.

      Even if you actually use some operator other than ">>" in the open statement (e.g. "+<" -- check out "perldoc -f open"), you can't alternate between appending to the file and also reading from the beginning or the middle. At least, if that's the sort of file access you really think you want, then you'd have to do something like this (check "perldoc -f seek" and "perldoc -f tell"):

      open( LINKS, "+<", "filename" ); while (<LINKS>) { my $lastread = tell LINKS; # keep track of where you are seek LINKS, 0, 2; # jump to end-of-file ... # do stuff with current value of $_, including: print LINKS "another line of data\n"; ... # when all done adding updates at the end... seek LINKS, $lastread, 0; # jump back for next read: }

      That's one way that might work -- I haven't tested it, but it seems kind of ugly (not to mention badly suited to the task), and I do not recommend it.

      I'd suggest just keeping an array -- in fact, a hash, keyed by url. (Assign a value of "undef" to each hash key as you come across new urls -- it'll use less memory.) That way, you have the ability to skip fetching a page that you've already seen, with something like:

      ... next if exists( $fetched{$url} ); ...

      Use disk storage for the url list only if you discover that an in-memory hash really doesn't fit in memory. And in that case (which is very unlikely), you have a choice between a simple list like you've been trying to do so far, or a tied hash file, like GDBM_File or DB_File (see "perldoc AnyDBM_File"). When I've used DBM's in the past, I've had trouble sometimes when writing new content to a DBM file that's already large -- it can take too long; and I'm not sure how these files play with empty values for the hash keys... hopefully it's a simple matter.

      If a DBM file doesn't work for you, you could still use a list, but I'd suggest an approach like the following (which hasn't been tested either) -- this assumes that you start with a file called "list1", which you have already created by some other means, and which contains some initial list of urls to scrape:

      while (-s "list1") { open( INP, "<list1" ) or die $!; open( OUT, ">list2" ) or die $!; while (<INP>) { chomp; (my $pagefile = $_ ) =~ s{[ \\/?&#!;]+}{_}g; # transform url int +o a safe file name # (update: added backslash to the character set) next if ( -e $pagefile ); # skip if we already have it ... # fetch the page, extract links if any ... foreach $link ( @links ) { ... # work out the full url ... print OUT $url, "\n"; } open PAGE, ">$pagefile"; # save current page to its own file print PAGE $content; close PAGE; } close INP; close OUT; system( "sort -u list2 > list1" ); # rewrite list1 with only the un +ique strings in list2 } print "list1 is empty now. We must be done.\n";
      That will entail a lot of file-system activity: check for file existence on every url, and open/write/close a file for every new page you hit, in addition to rewriting your url list on every iteration. But if you're scanning so many pages that their urls don't all fit in memory, you're going to be spending a lot of run-time no matter what, and doing the extra file i/o to eliminate duplicate fetches might be a win overall (unless you're using really fast internet access and really slow disk).
        Wow, thanks graff. Just one question: how do i "work out the full url" Is there a module that would convert links from /foo/foo.htm to foo.com/foo/foo.htm just by knowing what the current website is?