in reply to Re: Re: Useless use of substr in void context
in thread Useless use of substr in void context

100,000 strings in an array is no big deal these days -- every system has enough virtual memory to cover this and much more. In any case, you cannot open a file for append access and then read from it. No.

Even if you actually use some operator other than ">>" in the open statement (e.g. "+<" -- check out "perldoc -f open"), you can't alternate between appending to the file and also reading from the beginning or the middle. At least, if that's the sort of file access you really think you want, then you'd have to do something like this (check "perldoc -f seek" and "perldoc -f tell"):

open( LINKS, "+<", "filename" ); while (<LINKS>) { my $lastread = tell LINKS; # keep track of where you are seek LINKS, 0, 2; # jump to end-of-file ... # do stuff with current value of $_, including: print LINKS "another line of data\n"; ... # when all done adding updates at the end... seek LINKS, $lastread, 0; # jump back for next read: }

That's one way that might work -- I haven't tested it, but it seems kind of ugly (not to mention badly suited to the task), and I do not recommend it.

I'd suggest just keeping an array -- in fact, a hash, keyed by url. (Assign a value of "undef" to each hash key as you come across new urls -- it'll use less memory.) That way, you have the ability to skip fetching a page that you've already seen, with something like:

... next if exists( $fetched{$url} ); ...

Use disk storage for the url list only if you discover that an in-memory hash really doesn't fit in memory. And in that case (which is very unlikely), you have a choice between a simple list like you've been trying to do so far, or a tied hash file, like GDBM_File or DB_File (see "perldoc AnyDBM_File"). When I've used DBM's in the past, I've had trouble sometimes when writing new content to a DBM file that's already large -- it can take too long; and I'm not sure how these files play with empty values for the hash keys... hopefully it's a simple matter.

If a DBM file doesn't work for you, you could still use a list, but I'd suggest an approach like the following (which hasn't been tested either) -- this assumes that you start with a file called "list1", which you have already created by some other means, and which contains some initial list of urls to scrape:

while (-s "list1") { open( INP, "<list1" ) or die $!; open( OUT, ">list2" ) or die $!; while (<INP>) { chomp; (my $pagefile = $_ ) =~ s{[ \\/?&#!;]+}{_}g; # transform url int +o a safe file name # (update: added backslash to the character set) next if ( -e $pagefile ); # skip if we already have it ... # fetch the page, extract links if any ... foreach $link ( @links ) { ... # work out the full url ... print OUT $url, "\n"; } open PAGE, ">$pagefile"; # save current page to its own file print PAGE $content; close PAGE; } close INP; close OUT; system( "sort -u list2 > list1" ); # rewrite list1 with only the un +ique strings in list2 } print "list1 is empty now. We must be done.\n";
That will entail a lot of file-system activity: check for file existence on every url, and open/write/close a file for every new page you hit, in addition to rewriting your url list on every iteration. But if you're scanning so many pages that their urls don't all fit in memory, you're going to be spending a lot of run-time no matter what, and doing the extra file i/o to eliminate duplicate fetches might be a win overall (unless you're using really fast internet access and really slow disk).

Replies are listed 'Best First'.
Re: Re: Re: Re: Useless use of substr in void context
by mkurtis (Scribe) on Feb 23, 2004 at 01:57 UTC
    Wow, thanks graff. Just one question: how do i "work out the full url" Is there a module that would convert links from /foo/foo.htm to foo.com/foo/foo.htm just by knowing what the current website is?
      how do i "work out the full url"

      update: You only have to do this if HTML::SimpleLinkExtor doesn't do it for you -- and it apparently will do if for you if there is a <BASE HREF> tag in the page that you fetch. For that matter, when you happen to fetch a page that has no such tag, you could just put one in before passing the page to LinkExtor:

      s{(</head>)}{<base href="$current_docroot">$1}i;
      See below about determining the doc-root from a url. (end of update)

      I kinda thought you were already doing that (or trying to) in the code you posted -- but maybe you haven't had a chance to test that part yet (or you did test it, and it didn't work :P).

      I'll admit that it's not something I've tried myself, but I think my first impulse would be to do something similar to what you have (this is also not tested):

      # $link_target and $url_just_fetched are known... my $full_target; if ( $link_target =~ m{^/} ) { # initial "/" means "relative to doc-root": my $docroot = $url_just_fetched; $docroot =~ s{(?<=[^/]/)[^/].*}{}; # delete everything after first +single slash $full_target = $docroot . $link_target; } elsif ( $link_target !~ /^http/ ) { # it's presumably relative to the url just fetched, so: if ( $url_just_fetched =~ m{/$} ) { $full_target = $url_just_fetched . $link_target; } # this is the tricky part: elsif ( $url_just_fetched =~ /[?&=;:]|\.htm/ ) # probably not a di +rectory name... { my $last_slash = rindex( $url_just_fetched, "/" ) + 1; $full_target = substr( $url_just_fetched, 0, $last_slash ) . $li +nk_target; } else # assume its a directory name { $full_target = join "/", $url_just_fetched, $link_target; } } # (if $link_target does start with "http", then it's probably complete + already) # last step: $full_target =~ s/\#.*//; # in case the link target is a named anchor + within a page
      (I did do a quick "super search" for SoPW nodes that discuss "resolve relative href", and only found chromatic saying "there is no such module" -- maybe he was thinking of the question in a different context...)

      No doubt there'll be some tweaking to do -- especially the part that tries to guess if the last chunk of a url looks like a directory -- but that should get you started. You probably want to make that a separate subroutine (maybe someday it'll be a module).

        Thank you, I also have a question about using a hash, what do you mean by use undef for the key. Wouldn't this make all keys the same, and where do I put the url? Thanks for all the help!