in reply to Pushing unique items onto an array

Using an array for that isn't scalable as your search time will be linear. The better way to do it is to use your links as hash keys.

If you absolutely must keep an array, keep both an array and a hash. Something like this will do the trick.

push @array, $item unless $seen{$item}; $seen{$item}++;

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re: Re: Pushing unique items onto an array
by Anonymous Monk on May 30, 2003 at 17:16 UTC
    So just test for if (exists $links{url}) {...} instead?

    Let's say I start with one link, the one I fetch first (the parent url). I push that into a hash, with the url itself (?) as the key, and a value of '1' for unique (along with other values, like the HTML content, a status code, and other stuff).

    I then grab the links from that url after fetching, and come back with an array of say, 500 other links. I can sort those links for uniqueness (grep !$saw{$_}++, map {s/#.*$//;$_ } $thing;), but how do I determine if any of those links match the leading one I used to start the fetch, and subsequent links found on pages followed from there?

    If I blindly just push the new elements onto the hash, I'll overwrite any existing keys (and values) which have the same key, which might not be good for efficiency. Thanks for your help sauoq.

      how do I determine if any of those links match the leading one I used to start the fetch, and subsequent links found on pages followed from there?

      Strip any anchors off immediately upon capturing the link. Keep a single hash with all of the seen URLs. When you start, add your root URL to the hash.

      If I blindly just push the new elements onto the hash, I'll overwrite any existing keys (and values) which have the same key, which might not be good for efficiency.

      You won't overwrite the keys. Perl will see that the key exists and simply change the value. You can use exists() if you want to, but it isn't really necessary and I wouldn't do it.

      -sauoq
      "My two cents aren't worth a dime.";
      

      If I'm guessing right about what you are trying to do, here's some code that might help. You'll have to provide the fetch_urls() sub and you should also think about limiting the depth and/or breadth of your search.

      Untested...

      my %seen; my @urls; my $root_url = 'http://url.example.com/'; $seen{$root_url}++; push @urls, $root_url; while ($root_url = shift @urls) { for my $url ( fetch_urls($root_url) ) { $url =~ s/#.*$//; push @urls, $url unless $seen{$url}; $seen{$url}++; } }
      -sauoq
      "My two cents aren't worth a dime.";