kereekerra has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to do some screen scraping. I'm using this regex to capture the variable portion of the URLs. However in my output, I only have the first match. My error messages also indicate that I'm only capturing one entry when I should be capturing about forty.
@game_array = ($gamepage = ~m/onclick="document\.location\.href='([.]+ +)'"/g);
Update: The data I'm trying to scrape is the URL's of pages for specific SC2 replays from the site. "http://www.sc2rep.com/" I'm trying to scrape the individual game pages and output them to a file for use with a DBI script. On closer inspection the match I'm getting is incorrect. Here is a more complete script. Sorry about the incomplete information.
#!/usr/bin/perl -w use strict; use DBI; use Data::Dumper; my $page ="http://sc2rep.com/"; #URL of page withlist ofgame pages to +scrape, must be from sc2rep.com my $index=0; my $gamepage= `curl $page`; my $game_data; my @game_array; my $counter; @game_array= ($gamepage =~ m/onclick="document\.location\.href='([.]+) +'"/g); open (OUT,">sc2data") or die$!; for($index<40) { $game_data = `curl "http://sc2rep.com$game_array[$index]"`; print OUT "$game_data" . "\n end_replay \n"; print "looped successfully.\n"; $index++; } close OUT; exit;
The fix by stevieb made the code work. Removing the bracket made it work like a charm. Thank you all for your help and time.
|
|---|