Anyone know why I can't scrape this page?

lv211 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Anyone know why I can't scrape this page? by Lawliet (Curate) on Sep 06, 2008 at 20:09 UTC
`#!/usr/bin/perl -w use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new; my $url = "http://www.vegasinsider.com/nfl/odds/las-vegas/line-movemen +t/jets-@-dolphins.cfm/date/9-07-08/"; $mech->get($url) or die "Can't get url"; my $data = $mech->content(); print $data;` [download] Updated: ~~Works for me just fine.~~ I get a file not found page. Seems to work just fine when using single quotes though. The @ sign is being interpreted as an array in the link and therefore interpolated. I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply] [d/l]
Re: Anyone know why I can't scrape this page? by jettero (Monsignor) on Sep 06, 2008 at 20:10 UTC
It definitely works for me also. What if you did something like this (from LWP::UserAgent): `$mech->get($url); if ($mech->response->is_success) { print $mech->response->content; # or whatever } else { die $mech->response->status_line; }` [download] -Paul	[reply] [d/l]
Re: Anyone know why I can't scrape this page? by lv211 (Beadle) on Sep 06, 2008 at 22:19 UTC
I get a file not found page. Are you actually getting the same page that appears in a browser or are you getting the file not found page as well? Can someone print the results? I'm going to try it later on tonight when I get home. I wonder if the server was too busy when I tried the first few times. I was running the script when the college football games were going on. Accessing the page with a browser took a while too.	[reply]
Re^2: Anyone know why I can't scrape this page? by Lawliet (Curate) on Sep 06, 2008 at 23:20 UTC
"I get a file not found page." Upon closer review, I also get that error. The file that cannot be found is `/nfl/odds/las-vegas/line-movement/jets-dolphins.cfm` which I find odd seeing as the url in the script is `nfl/odds/las-vegas/line-movement/jets-@-dolphins.cfm`. (Notice the @ sign between the NFL teams). Either escape the alleged array or use single quotes. I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply] [d/l] [select]
Re^3: Anyone know why I can't scrape this page? by linuxer (Curate) on Sep 07, 2008 at 00:58 UTC
The '@' is followed by an '-' here, which is not allowed as valid identifier. So it can't be an array. But to make sure I would use single quotes, too. Update: striked that rubbish. See my answer.	[reply]
Re^4: Anyone know why I can't scrape this page? by Lawliet (Curate) on Sep 07, 2008 at 01:03 UTC
Re^5: Anyone know why I can't scrape this page? by linuxer (Curate) on Sep 07, 2008 at 01:08 UTC
Re: Anyone know why I can't scrape this page? by Lawliet (Curate) on Sep 07, 2008 at 16:45 UTC
Regarding the first update: Thought you meant a 500 Internal Server error at first (very confusing :P). From the Mechanize FAQ: My Mech program gets these 500 errors. A 500 error from the web server says that the program on the server side died. Probably the web server program was expecting certain inputs that you didn't supply, and instead of handling it nicely, the program died. Whatever the cause of the 500 error, if it works in the browser, but not in your Mech program, you're not acting like the browser. It takes a helluva long time to connect to the site using a normal browser, so I do not think the problem is your script. Either that or the javascript on the page is severely interfering with Mech. If it is the latter case, I suggest WWW:Selenium (I heard it knows how to interpret javascript. Take the suggestion with a grain of sugar.) Regarding the second update: Post thine code. I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply]
Re^2: Anyone know why I can't scrape this page? by lv211 (Beadle) on Sep 07, 2008 at 17:48 UTC
Here is the code. I know the problem happens when I try to pass the variable from the array to the subroutine. Do you know how I can pass the variable from the array into the subroutine with single quotes? I was thinking it would look something like my @game = qr(@_); but I couldn't get that to work or find documentation that could answer my question. #!/usr/bin/perl use WWW::Mechanize; #use strict; ### Create the Bot and set the Variables my $mech = WWW::Mechanize->new; my $url = 'http://www.vegasinsider.com/nfl/odds/las-vegas/line-movemen +t/bengals-@-ravens.cfm/date/9-07-08/time/1300#J'; save_file ($url); #### sub save_file { my $mech = WWW::Mechanize->new; $mech->timeout(60); my @game = @_; foreach (@game){ print "$_\n"; $_ =~ m{http://www.vegasinsider.com/(.?)/odds/(.?)/line- +movement/(.?)-@-(.?).cfm/date/(.*?)/time/}; print "$1 $2 $3 $4 $5\n"; my $filename = 'C:\Documents and Settings\Owner\Desktop\VI + Data\sub.html'; print "Getting $filename\n"; $mech->get( "$_", ":content_file" => $filename ) or die "C +an't get url"; print $mech->status; my $data = $mech->content; print " ", -s $filename, " bytes\n"; print $data; } } ## my $file = 'C:\Documents and Settings\Owner\Desktop\VI Data\ne +w.html'; $mech->timeout(60); $mech->get($url, ":content_file" => $file) or die "Can't get u +rl"; print $mech->status; my $data = $mech->content; #print " ", -s $filename, " bytes\n"; print $data; [download]	[reply] [d/l]
Re^3: Anyone know why I can't scrape this page? by Lawliet (Curate) on Sep 07, 2008 at 18:17 UTC
At first glance, let me suggest that you uncomment `use strict;` and also make sure you `use warnings;` (You can also use warnings by placing `-w` at the end of your hashbang line: `#!/usr/bin/perl -w`). I know the problem happens when I try to pass the variable from the array to the subroutine. You are passing a scalar to the subroutine and then assigning that scalar to an array. You are not passing an array to the subroutine. I'll try to explain: `my $url = 'http://www.vegasinsider.com/nfl/odds/.../1300#J'; # Assigni +ng that url to the scalar url. save_file ($url); # Calling the subroutine while passing the scalar $u +rl sub save_file { # Initiating sub my @game = @_; # Populating an array with all the contents of the + arguments that are passed to the subroutine. In this case, just one; + $url` [download] I assume you are going to pass multiple urls to the subroutine. Anyway, continuing on. Do you know how I can pass the variable from the array into the subroutine with single quotes? See my above explanation. I am confused as to what you mean here. Could you outline what you are trying to do? I'm so adjective, I verb nouns! chomp; # nom nom nom	[reply] [d/l] [select]
Re^4: Anyone know why I can't scrape this page? by lv211 (Beadle) on Sep 07, 2008 at 18:32 UTC
Re^5: Anyone know why I can't scrape this page? by Lawliet (Curate) on Sep 07, 2008 at 18:54 UTC
Re: Anyone know why I can't scrape this page? by linuxer (Curate) on Sep 07, 2008 at 01:04 UTC
Check your URL, please. I copied it to my browser and got a "page not found", too. Then I saw the URL itself: `http://www.vegasinsider.com/nfl/odds/las-vegas/line-movement/jets-@-dolphins.%C2%ADcfm/date/9-07-08/` Looks as if there are some bytes too much... When I removed the '%C2%AD' from the URL, the page was found and I saw a table with game stats! update: I tested above with Firefox 2.0.0.16. Same test with konqueror 3.5.9 is little different, because konqueror already shows a red '-' before the 'cfm' when displaying the page. Opera 9.52 seems to remove the strange characters silently. Update: Stepped in a bad trap of auto wrap option.... whatever, still don't understand completely what went wrong.	[reply]