Re: Page Scraping

As much as I understood, you don't actually need to login to see the needed pages. Otherwise, this seems to do what you want to do (or something resembling it):

use strict;
use warnings;
use Fatal qw/ open close /;
use WWW::Mechanize;
use Carp;
     
my $mech =  WWW::Mechanize->new( autocheck => 1 );  

open my $pro_list, '>>', 'profiles.out';
for my $curr_page (1..3) {
        $mech->get("http://indiecharts.com/indie_Music_Artists.asp?Key
+word=&Page=$curr_page&butname=");
        my @artist_links = $mech->find_all_links(url_regex => qr/\d{9}
+/);
        print scalar @artist_links, " matching links on page $curr_pag
+e\n";
        for my $artist_link (@artist_links) {
                print $pro_list $artist_link->text(), "\n";
        }
}
[download]

Comment on Re: Page Scraping Download Code

Replies are listed 'Best First'.
Re^2: Page Scraping by 80degreez (Initiate) on May 01, 2007 at 21:14 UTC
Akho, that kinda worked, but it retrieved the artist names instead of the butname= id #	[reply]
Re^3: Page Scraping by akho (Hermit) on May 01, 2007 at 21:18 UTC
You could work this out yourself, but `$artist_link->url() =~ /(\d{9})/; print $pro_list $1, "\n";` [download] should help. Ask if you don't understand some parts of my script.	[reply] [d/l]
Re^4: Page Scraping by shigetsu (Hermit) on May 01, 2007 at 21:36 UTC
That doesn't catch cases when the regular expression doesn't match and hence `$1` might have a value assigned to it from a previous successful capturing regular expression match. I humbly suggest either: `print {$pro_list} $1, "\n" if $artist_link->url() =~ /(\d{9})/;` [download] or `if ($artist_link->url() =~ /(\d{9})/) { print {$pro_list} $1, "\n"; }` [download] The curly braces around `{$pro_list}` disambiguates its use as the filehandle that is printed to.	[reply] [d/l] [select]
Re^5: Page Scraping by akho (Hermit) on May 02, 2007 at 08:19 UTC