lazybowel has asked for the wisdom of the Perl Monks concerning the following question:

hi, i posted a question about mechanize and the $mech->links() method, from what i read that produces a hash refrence to all the links on a page. and in order to get the actual urls its $url = $links->url(); which works fine.. now i have a different problem. im trying to scrape all the links off of a page that contain 6 digits somewhere in the url. however i keep getting duplicates once i print what is in the array, so how would i be able to get rid of the duplicates and save that to another array for later use.. this is the code that i have so far.
$i=1; $mech->get("http://www.somesite.com/"); @links = $agent->links(); foreach (@links) { if ($links[$i]->url() =~ m!(-[0-9][0-9][0-9][0-9][0-9][0-9]*)!) { $links[$i]->url() =~ m!([0-9][0-9][0-9][0-9][0-9][0-9]*)!; $art[$i] = $1; print "$art[$i]\n"; } $i++; }

Replies are listed 'Best First'.
Re: Help needed with mechanize
by kyle (Abbot) on May 09, 2007 at 02:30 UTC

    Have a look at How can I extract just the unique elements of an array? (FAQ).

    In your case, you could use one of those methods to strip out the duplicates before the foreach loop, or you could integrate a method into the loop itself sort of like this:

    my %unique_urls; foreach (@links) { if ($links[$i]->url() =~ m!(-[0-9][0-9][0-9][0-9][0-9][0-9]*)!) { $links[$i]->url() =~ m!([0-9][0-9][0-9][0-9][0-9][0-9]*)!; my $url = $1; $art[$i] = $url; print "$url\n" if ( ! $unique_urls{$url}++ ); } $i++; }

    I recommend something like this:

    my %unique_urls; my @art = grep { m{-(\d{5,})} && !$unique_urls{$1}++ } map { $_->url() } $agent->links(); undef %unique_urls; print map { "$_\n" } @art;

    That may be diverging too much from your original intention, though. It's hard to recommend with confidence without reading the context.

Re: Help needed with mechanize
by chrism01 (Friar) on May 09, 2007 at 01:06 UTC
    A few things spring to mind:

    1. please use < code > < / code > tags
    2. if you want the count of items in the array, use scalar(@array)
    3. you appear to be counting from 1, but Perl arrays start at 0 (zero)
    4. generally, the easy way to avoid duplicates is to store the data in a hash eg
    $my_hash{$my_6digits} = 1;

    Cheers
    Chris

Re: Help needed with mechanize
by naikonta (Curate) on May 09, 2007 at 05:16 UTC
    WWW::Mechanize does provides url_regex options you might interest in using it:
    # untested! $agent->find_link(url_regex => qr/\d{6}/);
    Well, it will also find links with more than 6 digits, but I think you got the picture.

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Re: Help needed with mechanize
by akho (Hermit) on May 09, 2007 at 09:16 UTC
    I wonder what $agent is.

    Otherwise,

    my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get("http://www.somesite.com/"); my %art; for my $link ($mech->find_link( url_regex => qr/\d{6}/ )) { $link->url() =~ /(\d{6,})/; $art{$1} = 1; } print keys %art;

    should work.

    You're trying to write a C-style loop; Perl can do better.

      thank you for your help guys i got it to work now!!