MAC25 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I'm using the following WWW::Mechanize to retieve all links on an HTML page:

@link = $mech->find_all_links(url_regex => qr/=\w\d{2}:\d{2}:\d{2}:\d{2}/)

The page has references to the same link several times and i'd like to know how to remove the duplicate links from the array before i get the link using:

foreach my $page (@link) {
$mech->get ($page);
}

I tried using the following, which seems to be the standard way:

undef %saw;
@saw{@link} = ();
@out = sort keys %saw;

WWW:Mechanize complains though when it tries to retrieve the resultant though, not sure if this is because it's a hash.

The error reported is:

Error GETing WWW::Mechanize::Link=ARRAY(0x1e4c14c): Protocol scheme 'www' is not supported at G:plan_resident_1.pl line 67

With line 67 being $mech->get ($page);

Any help or suggestions appreciated.

Replies are listed 'Best First'.
Re: WWW::Mechanize link array de-dup
by Fletch (Bishop) on May 06, 2008 at 20:52 UTC

    You're trying to use a list of WWW::Mechanize::Link instances as hash keys but they're not getting stringified into something sane (rather you're getting the normal default Perl reference stringification). You're going to need to do your uniqufication differently.

    my %seen; my @uniq_urls = grep !($seen{ $_->url }++), @link;

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      I also recommend using a different link object method, url_abs, or else you might get relative:absolute duplicates. You'll also get what you expect instead of lots of apparently fragmentary URIs.

      my @uniq_urls = grep !($seen{ $_->url_abs }++), @link;

        Presuming he uses the W::M::Link instances then his $mech->get( $foo ) calls should Do The Right Thing™ for relative URLs, but yup very good point about absolutifying before comparing (and the URI::URL instances url_abs returns stringifies sanely when you use it as a hash key, so if you're not going to subsequently pass them on to a get call you could use a variation of your (the OP's)original code: my %seen; @seen{ map $_->uri_abs, @links ) = ();my @out = keys %seen;).

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Re: WWW::Mechanize link array de-dup
by moritz (Cardinal) on May 06, 2008 at 20:44 UTC
    Most likely $page is missing a leading http://. Try to print it before retrieving the URL.

    The removal of duplicates seems to work:

    use strict; use warnings; use Data::Dumper; my @list = qw(a b c d a a b e); my %seen; @seen{@list} = (); print Dumper [ sort keys %seen ];