Help needed with mechanize

lazybowel has asked for the wisdom of the Perl Monks concerning the following question:

hi, i posted a question about mechanize and the $mech->links() method, from what i read that produces a hash refrence to all the links on a page. and in order to get the actual urls its $url = $links->url(); which works fine.. now i have a different problem. im trying to scrape all the links off of a page that contain 6 digits somewhere in the url. however i keep getting duplicates once i print what is in the array, so how would i be able to get rid of the duplicates and save that to another array for later use.. this is the code that i have so far.

$i=1;
$mech->get("http://www.somesite.com/");
@links = $agent->links();

foreach (@links) {
if ($links[$i]->url() =~ m!(-[0-9][0-9][0-9][0-9][0-9][0-9]*)!) {
$links[$i]->url() =~ m!([0-9][0-9][0-9][0-9][0-9][0-9]*)!;
$art[$i] = $1;
print "$art[$i]\n";
}
$i++;
}
[download]

Comment on Help needed with mechanize Download Code

Replies are listed 'Best First'.
Re: Help needed with mechanize by kyle (Abbot) on May 09, 2007 at 02:30 UTC
Have a look at How can I extract just the unique elements of an array? (FAQ). In your case, you could use one of those methods to strip out the duplicates before the `foreach` loop, or you could integrate a method into the loop itself sort of like this: `my %unique_urls; foreach (@links) { if ($links[$i]->url() =~ m!(-[0-9][0-9][0-9][0-9][0-9][0-9])!) { $links[$i]->url() =~ m!([0-9][0-9][0-9][0-9][0-9][0-9])!; my $url = $1; $art[$i] = $url; print "$url\n" if ( ! $unique_urls{$url}++ ); } $i++; }` [download] I recommend something like this: `my %unique_urls; my @art = grep { m{-(\d{5,})} && !$unique_urls{$1}++ } map { $_->url() } $agent->links(); undef %unique_urls; print map { "$_\n" } @art;` [download] That may be diverging too much from your original intention, though. It's hard to recommend with confidence without reading the context.	[reply] [d/l] [select]
Re: Help needed with mechanize by chrism01 (Friar) on May 09, 2007 at 01:06 UTC
A few things spring to mind: 1. please use < code > < / code > tags 2. if you want the count of items in the array, use scalar(@array) 3. you appear to be counting from 1, but Perl arrays start at 0 (zero) 4. generally, the easy way to avoid duplicates is to store the data in a hash eg $my_hash{$my_6digits} = 1; Cheers Chris	[reply]
Re: Help needed with mechanize by naikonta (Curate) on May 09, 2007 at 05:16 UTC
WWW::Mechanize does provides `url_regex` options you might interest in using it: `# untested! $agent->find_link(url_regex => qr/\d{6}/);` [download] Well, it will also find links with more than 6 digits, but I think you got the picture. Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!	[reply] [d/l] [select]
Re: Help needed with mechanize by akho (Hermit) on May 09, 2007 at 09:16 UTC
I wonder what `$agent` is. Otherwise, `my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get("http://www.somesite.com/"); my %art; for my $link ($mech->find_link( url_regex => qr/\d{6}/ )) { $link->url() =~ /(\d{6,})/; $art{$1} = 1; } print keys %art;` [download] should work. You're trying to write a C-style loop; Perl can do better.	[reply] [d/l] [select]
Re^2: Help needed with mechanize by lazybowel (Acolyte) on May 09, 2007 at 21:15 UTC
thank you for your help guys i got it to work now!!	[reply]