in reply to Re: Re: Useless use of substr in void context
in thread Useless use of substr in void context
Even if you actually use some operator other than ">>" in the open statement (e.g. "+<" -- check out "perldoc -f open"), you can't alternate between appending to the file and also reading from the beginning or the middle. At least, if that's the sort of file access you really think you want, then you'd have to do something like this (check "perldoc -f seek" and "perldoc -f tell"):
open( LINKS, "+<", "filename" ); while (<LINKS>) { my $lastread = tell LINKS; # keep track of where you are seek LINKS, 0, 2; # jump to end-of-file ... # do stuff with current value of $_, including: print LINKS "another line of data\n"; ... # when all done adding updates at the end... seek LINKS, $lastread, 0; # jump back for next read: }
That's one way that might work -- I haven't tested it, but it seems kind of ugly (not to mention badly suited to the task), and I do not recommend it.
I'd suggest just keeping an array -- in fact, a hash, keyed by url. (Assign a value of "undef" to each hash key as you come across new urls -- it'll use less memory.) That way, you have the ability to skip fetching a page that you've already seen, with something like:
... next if exists( $fetched{$url} ); ...
Use disk storage for the url list only if you discover that an in-memory hash really doesn't fit in memory. And in that case (which is very unlikely), you have a choice between a simple list like you've been trying to do so far, or a tied hash file, like GDBM_File or DB_File (see "perldoc AnyDBM_File"). When I've used DBM's in the past, I've had trouble sometimes when writing new content to a DBM file that's already large -- it can take too long; and I'm not sure how these files play with empty values for the hash keys... hopefully it's a simple matter.
If a DBM file doesn't work for you, you could still use a list, but I'd suggest an approach like the following (which hasn't been tested either) -- this assumes that you start with a file called "list1", which you have already created by some other means, and which contains some initial list of urls to scrape:
That will entail a lot of file-system activity: check for file existence on every url, and open/write/close a file for every new page you hit, in addition to rewriting your url list on every iteration. But if you're scanning so many pages that their urls don't all fit in memory, you're going to be spending a lot of run-time no matter what, and doing the extra file i/o to eliminate duplicate fetches might be a win overall (unless you're using really fast internet access and really slow disk).while (-s "list1") { open( INP, "<list1" ) or die $!; open( OUT, ">list2" ) or die $!; while (<INP>) { chomp; (my $pagefile = $_ ) =~ s{[ \\/?&#!;]+}{_}g; # transform url int +o a safe file name # (update: added backslash to the character set) next if ( -e $pagefile ); # skip if we already have it ... # fetch the page, extract links if any ... foreach $link ( @links ) { ... # work out the full url ... print OUT $url, "\n"; } open PAGE, ">$pagefile"; # save current page to its own file print PAGE $content; close PAGE; } close INP; close OUT; system( "sort -u list2 > list1" ); # rewrite list1 with only the un +ique strings in list2 } print "list1 is empty now. We must be done.\n";
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: Useless use of substr in void context
by mkurtis (Scribe) on Feb 23, 2004 at 01:57 UTC | |
by graff (Chancellor) on Feb 23, 2004 at 07:52 UTC | |
by mkurtis (Scribe) on Feb 25, 2004 at 02:59 UTC | |
by leira (Monk) on Feb 25, 2004 at 03:38 UTC |