Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a process which is gathering links from a webpage for later conversion/processing, and I want to push unique links onto an array, so I can process them later.

If I grab slashdot.org, for example and extract the links from it, it will link to freshmeat.net. If I then grab freshmeat.net, and grab the links from it, there is a link (back) to slashdot.org. I would like to push the unique links found onto an array I can then manipulate later.

I know I can compute the intersection of two arrays using code similar to the snippet below, but how do I check at push() time, if the element I'm going to be pushing onto the array, is in fact unique, before I push it on?

my @fruit = ('apple', 'pear', 'lemon', 'celery', 'pear'); my %seen = (); foreach (@fruit) { $seen{$_}=1; } my @unique = keys %seen;

Replies are listed 'Best First'.
Re: Pushing unique items onto an array
by sauoq (Abbot) on May 30, 2003 at 17:05 UTC

    Using an array for that isn't scalable as your search time will be linear. The better way to do it is to use your links as hash keys.

    If you absolutely must keep an array, keep both an array and a hash. Something like this will do the trick.

    push @array, $item unless $seen{$item}; $seen{$item}++;

    -sauoq
    "My two cents aren't worth a dime.";
    
      So just test for if (exists $links{url}) {...} instead?

      Let's say I start with one link, the one I fetch first (the parent url). I push that into a hash, with the url itself (?) as the key, and a value of '1' for unique (along with other values, like the HTML content, a status code, and other stuff).

      I then grab the links from that url after fetching, and come back with an array of say, 500 other links. I can sort those links for uniqueness (grep !$saw{$_}++, map {s/#.*$//;$_ } $thing;), but how do I determine if any of those links match the leading one I used to start the fetch, and subsequent links found on pages followed from there?

      If I blindly just push the new elements onto the hash, I'll overwrite any existing keys (and values) which have the same key, which might not be good for efficiency. Thanks for your help sauoq.

        how do I determine if any of those links match the leading one I used to start the fetch, and subsequent links found on pages followed from there?

        Strip any anchors off immediately upon capturing the link. Keep a single hash with all of the seen URLs. When you start, add your root URL to the hash.

        If I blindly just push the new elements onto the hash, I'll overwrite any existing keys (and values) which have the same key, which might not be good for efficiency.

        You won't overwrite the keys. Perl will see that the key exists and simply change the value. You can use exists() if you want to, but it isn't really necessary and I wouldn't do it.

        -sauoq
        "My two cents aren't worth a dime.";
        

        If I'm guessing right about what you are trying to do, here's some code that might help. You'll have to provide the fetch_urls() sub and you should also think about limiting the depth and/or breadth of your search.

        Untested...

        my %seen; my @urls; my $root_url = 'http://url.example.com/'; $seen{$root_url}++; push @urls, $root_url; while ($root_url = shift @urls) { for my $url ( fetch_urls($root_url) ) { $url =~ s/#.*$//; push @urls, $url unless $seen{$url}; $seen{$url}++; } }
        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Pushing unique items onto an array
by cciulla (Friar) on May 30, 2003 at 17:09 UTC

    Try "exists".

    my @fruit = ('apple', 'pear', 'lemon', 'celery', 'pear'); my %seen = (); foreach (@fruit) { $seen{$_} = 1 if not exists $seen{$_}; } my @unique = keys %seen;

    .oO(exists $fruit{'celery'}?)

Re: Pushing unique items onto an array
by DrManhattan (Chaplain) on May 30, 2003 at 17:09 UTC
    Use a hash instead of an array to store your links. E.g.
    $links{'http://www.slashdot.org'} = 1;
    Then just use keys %links to get the list back.

    -Matt

Re: Pushing unique items onto an array
by Cody Pendant (Prior) on May 31, 2003 at 00:16 UTC
    I had a script like this and made a subroutine to filter link arrays:
    @links = filter(@links); sub filter { #pseudocode strip off any hashes because "foo.com/bar.htm#quux" is the same as "foo.com/bar.htm" strip off "index.html" and so on because "foo.com/index.html" is the same as "foo.com/" various other things ... NOW feed them through a hash to guarantee uniqueness NOW grep them against the %seen hash return whatever's left }

    --
    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
Re: Pushing unique items onto an array
by Limbic~Region (Chancellor) on May 31, 2003 at 01:04 UTC
    Anonymous Monk,
    You already have great solutions. I am going to offer up the module Array::Unique. I have been playing around with closures, references, objects, etc and have been looking for a simple use for tie - and I was about to code this up when I thought better of it and checked CPAN. I haven't looked under the hood, but I am betting it is much better written than I could.

    I wouldn't use the module unless you really have a reason not to use the hash approach. Sometimes it is just nice knowing it is in your toolbox.

    Cheers - L~R