in reply to Re: Pushing unique items onto an array
in thread Pushing unique items onto an array

So just test for if (exists $links{url}) {...} instead?

Let's say I start with one link, the one I fetch first (the parent url). I push that into a hash, with the url itself (?) as the key, and a value of '1' for unique (along with other values, like the HTML content, a status code, and other stuff).

I then grab the links from that url after fetching, and come back with an array of say, 500 other links. I can sort those links for uniqueness (grep !$saw{$_}++, map {s/#.*$//;$_ } $thing;), but how do I determine if any of those links match the leading one I used to start the fetch, and subsequent links found on pages followed from there?

If I blindly just push the new elements onto the hash, I'll overwrite any existing keys (and values) which have the same key, which might not be good for efficiency. Thanks for your help sauoq.

Replies are listed 'Best First'.
Re: Re: Re: Pushing unique items onto an array
by sauoq (Abbot) on May 30, 2003 at 17:27 UTC
    how do I determine if any of those links match the leading one I used to start the fetch, and subsequent links found on pages followed from there?

    Strip any anchors off immediately upon capturing the link. Keep a single hash with all of the seen URLs. When you start, add your root URL to the hash.

    If I blindly just push the new elements onto the hash, I'll overwrite any existing keys (and values) which have the same key, which might not be good for efficiency.

    You won't overwrite the keys. Perl will see that the key exists and simply change the value. You can use exists() if you want to, but it isn't really necessary and I wouldn't do it.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Re: Re: Pushing unique items onto an array
by sauoq (Abbot) on May 30, 2003 at 17:43 UTC

    If I'm guessing right about what you are trying to do, here's some code that might help. You'll have to provide the fetch_urls() sub and you should also think about limiting the depth and/or breadth of your search.

    Untested...

    my %seen; my @urls; my $root_url = 'http://url.example.com/'; $seen{$root_url}++; push @urls, $root_url; while ($root_url = shift @urls) { for my $url ( fetch_urls($root_url) ) { $url =~ s/#.*$//; push @urls, $url unless $seen{$url}; $seen{$url}++; } }
    -sauoq
    "My two cents aren't worth a dime.";