lv211 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to scrape some information off this site. When I try to scrape the page I get an error. When I view the link in my browser I notice that it looks like the page is being redirected to another page. I checked the headers and couldn't figure out what is going on. Any suggestions?
#!/usr/bin/perl use WWW::Mechanize; my $mech = WWW::Mechanize->new; my $url = "http://www.vegasinsider.com/nfl/odds/las-vegas/line-movemen +t/jets-@-dolphins.cfm/date/9-07-08/"; $mech->get($url) or die "Can't get url"; my $data = $mech->content; print $data;

Update - The single quotes works. That worked for a second and then I started getting a 500 error.

Update II - When I set the timeout to 60 I am more likely to get the page. Also I set the script in a subroutine and ran the same process outside the subroutine. It does not work when its in the sub routine which makes me think it has something to do with how I'm putting the variable into the subroutine.

Update III - I got it working with the subroutine. What you want to do is use URI for the urls.

Replies are listed 'Best First'.
Re: Anyone know why I can't scrape this page?
by Lawliet (Curate) on Sep 06, 2008 at 20:09 UTC
    #!/usr/bin/perl -w use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new; my $url = "http://www.vegasinsider.com/nfl/odds/las-vegas/line-movemen +t/jets-@-dolphins.cfm/date/9-07-08/"; $mech->get($url) or die "Can't get url"; my $data = $mech->content(); print $data;

    Updated: Works for me just fine. I get a file not found page. Seems to work just fine when using single quotes though. The @ sign is being interpreted as an array in the link and therefore interpolated.

    I'm so adjective, I verb nouns!

    chomp; # nom nom nom

Re: Anyone know why I can't scrape this page?
by jettero (Monsignor) on Sep 06, 2008 at 20:10 UTC
    It definitely works for me also. What if you did something like this (from LWP::UserAgent):
    $mech->get($url); if ($mech->response->is_success) { print $mech->response->content; # or whatever } else { die $mech->response->status_line; }

    -Paul

Re: Anyone know why I can't scrape this page?
by lv211 (Beadle) on Sep 06, 2008 at 22:19 UTC

    I get a file not found page. Are you actually getting the same page that appears in a browser or are you getting the file not found page as well? Can someone print the results?

    I'm going to try it later on tonight when I get home.

    I wonder if the server was too busy when I tried the first few times. I was running the script when the college football games were going on. Accessing the page with a browser took a while too.

      "I get a file not found page."

      Upon closer review, I also get that error. The file that cannot be found is /nfl/odds/las-vegas/line-movement/jets-dolphins.cfm which I find odd seeing as the url in the script is nfl/odds/las-vegas/line-movement/jets-@-dolphins.cfm. (Notice the @ sign between the NFL teams).

      Either escape the alleged array or use single quotes.

      I'm so adjective, I verb nouns!

      chomp; # nom nom nom

        The '@' is followed by an '-' here, which is not allowed as valid identifier.

        So it can't be an array. But to make sure I would use single quotes, too.

        Update: striked that rubbish. See my answer.
Re: Anyone know why I can't scrape this page?
by Lawliet (Curate) on Sep 07, 2008 at 16:45 UTC

    Regarding the first update: Thought you meant a 500 Internal Server error at first (very confusing :P). From the Mechanize FAQ:

    My Mech program gets these 500 errors. A 500 error from the web server says that the program on the server side died. Probably the web server program was expecting certain inputs that you didn't supply, and instead of handling it nicely, the program died. Whatever the cause of the 500 error, if it works in the browser, but not in your Mech program, you're not acting like the browser.

    It takes a helluva long time to connect to the site using a normal browser, so I do not think the problem is your script. Either that or the javascript on the page is severely interfering with Mech. If it is the latter case, I suggest WWW:Selenium (I heard it knows how to interpret javascript. Take the suggestion with a grain of sugar.)

    Regarding the second update: Post thine code.

    I'm so adjective, I verb nouns!

    chomp; # nom nom nom

      Here is the code. I know the problem happens when I try to pass the variable from the array to the subroutine. Do you know how I can pass the variable from the array into the subroutine with single quotes?

      I was thinking it would look something like my @game = qr(@_); but I couldn't get that to work or find documentation that could answer my question.

      #!/usr/bin/perl use WWW::Mechanize; #use strict; ### Create the Bot and set the Variables my $mech = WWW::Mechanize->new; my $url = 'http://www.vegasinsider.com/nfl/odds/las-vegas/line-movemen +t/bengals-@-ravens.cfm/date/9-07-08/time/1300#J'; save_file ($url); #### sub save_file { my $mech = WWW::Mechanize->new; $mech->timeout(60); my @game = @_; foreach (@game){ print "$_\n"; $_ =~ m{http://www.vegasinsider.com/(.*?)/odds/(.*?)/line- +movement/(.*?)-@-(.*?).cfm/date/(.*?)/time/}; print "$1 $2 $3 $4 $5\n"; my $filename = 'C:\Documents and Settings\Owner\Desktop\VI + Data\sub.html'; print "Getting $filename\n"; $mech->get( "$_", ":content_file" => $filename ) or die "C +an't get url"; print $mech->status; my $data = $mech->content; print " ", -s $filename, " bytes\n"; print $data; } } ## my $file = 'C:\Documents and Settings\Owner\Desktop\VI Data\ne +w.html'; $mech->timeout(60); $mech->get($url, ":content_file" => $file) or die "Can't get u +rl"; print $mech->status; my $data = $mech->content; #print " ", -s $filename, " bytes\n"; print $data;

        At first glance, let me suggest that you uncomment use strict; and also make sure you use warnings; (You can also use warnings by placing -w at the end of your hashbang line: #!/usr/bin/perl -w).

        I know the problem happens when I try to pass the variable from the array to the subroutine.

        You are passing a scalar to the subroutine and then assigning that scalar to an array. You are not passing an array to the subroutine. I'll try to explain:

        my $url = 'http://www.vegasinsider.com/nfl/odds/.../1300#J'; # Assigni +ng that url to the scalar url. save_file ($url); # Calling the subroutine while passing the scalar $u +rl sub save_file { # Initiating sub my @game = @_; # Populating an array with all the contents of the + arguments that are passed to the subroutine. In this case, just one; + $url

        I assume you are going to pass multiple urls to the subroutine. Anyway, continuing on.

        Do you know how I can pass the variable from the array into the subroutine with single quotes?

        See my above explanation. I am confused as to what you mean here. Could you outline what you are trying to do?

        I'm so adjective, I verb nouns!

        chomp; # nom nom nom

Re: Anyone know why I can't scrape this page?
by linuxer (Curate) on Sep 07, 2008 at 01:04 UTC

    Check your URL, please.

    I copied it to my browser and got a "page not found", too.
    Then I saw the URL itself:

    http://www.vegasinsider.com/nfl/odds/las-vegas/line-movement/jets-@-dolphins.%C2%ADcfm/date/9-07-08/

    Looks as if there are some bytes too much...

    When I removed the '%C2%AD' from the URL, the page was found and I saw a table with game stats!

    update:

    I tested above with Firefox 2.0.0.16.
    Same test with konqueror 3.5.9 is little different, because konqueror already shows a red '-' before the 'cfm' when displaying the page.
    Opera 9.52 seems to remove the strange characters silently.

    Update:

    Stepped in a bad trap of auto wrap option.... whatever, still don't understand completely what went wrong.