Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

retrieving html

by limner (Novice)
on Feb 18, 2014 at 19:09 UTC ( [id://1075369]=perlquestion: print w/replies, xml ) Need Help??

limner has asked for the wisdom of the Perl Monks concerning the following question:

Hi i'm creating a small perl program in order to get the html source from pages in order to check them. i wrote this:

#!/usr/local/bin/perl
use HTTP::Cookies;
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$ua->cookie_jar(HTTP::Cookies->new(file => "cookies.txt", autosave => 1));
#$ua->agent("Mozilla 5.0");
$url=" site address";
$v1="2014-03-02";
$v2="2014-03-03";
$req = HTTP::Request->new(GET => $url, $checkin => $v1, $checkout => $v2);
$req->header('Accept' => 'text/html');
$res = $ua->request($req);
if ($res->is_success)
{
$filename="dati_html_test.txt";
open MYFILE, ">:utf8", $filename;
print MYFILE $res->decoded_content; # or whatever
close (MYFILE);
}
else
{
print "Error: " . $res->status_line . "\n";
}


The problem i have are those:
1) it seems that the website thinks that i'm a bot: i'm unable to set a proprer user agent
2) it seems that the url parameters are not understood by the webserver because it always answer me with a html page like i didn't submit any parameter.
any help?

Replies are listed 'Best First'.
Re: retrieving html
by davido (Cardinal) on Feb 18, 2014 at 19:27 UTC

    $checkin and $checkout are uninitialized scalar variables. What are they supposed to contain?


    Dave

Re: retrieving html
by zentara (Archbishop) on Feb 18, 2014 at 21:07 UTC
    Try something like this old trick, maybe update the version values. Also see what's my user agent. Just copy in what you see in your browser.
    my $a = int rand(9); my $a1 = int rand(9); my $agent = "Mozilla/1.$a.$a1 (compatible; MSIE; NT 6.0 )";

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: retrieving html
by Kenosis (Priest) on Feb 18, 2014 at 22:24 UTC
Re: retrieving html
by Anonymous Monk on Feb 19, 2014 at 10:51 UTC
    "it seems that the website thinks that i'm a bot: i'm unable to set a proprer user agent" - Because you are a bot, and the probably don't want you scraping their site.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1075369]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-03-29 04:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found