Re: Cutting Out Previously Visited Web Pages in A Web Spider

Have you looked at the FAQ?

Have you used the Search?

The answers are there, probably under the keyword duplicate, probably in reference to arrays and hashes.

Come back after you have thoroughly searched those, and ask again if there are elements you do not understand

--
TTTATCGGTCGTTATATAGATGTTTGCA

Comment on Re: Cutting Out Previously Visited Web Pages in A Web Spider

Replies are listed 'Best First'.
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 11, 2004 at 02:58 UTC
Searched for and found How do I avoid inserting duplicate numbers into an Access table?. Read the perlfaq4 which is the same as perldoc -q duplicate. I guess what I don't understand is why my way doesn't work. I am trying to take all the links that I have visited and if any of them are the same as $links, then don't print them to the file that I take urls to be visited out of. I also have no clue on how to do it differently. I don't think that a database approach will work, I already tried. I also read perldoc -f splice, but I don't see how I will know what position the element is at. Thanks	[reply]
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by eric256 (Parson) on Mar 11, 2004 at 03:51 UTC
If you are saving info on each page you find to a file then couldn't you just check to see if the file already exists before writing to it?? I didn't realy understand your code but you could save each url in a hash. Then just check to see if the url already exists in your hash before reading the page agian. The hash would only get as big the number of sites you spider. ___________ Eric Hodges	[reply]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 11, 2004 at 03:59 UTC
Well, I can't check if the file exists because they are numbered, not named anything identifying. I'm sure a hash would work, but I don't know how. Is there any reason why the current set up doesn't work? Thanks	[reply]