I have in the past successfully used LWP::Useragent to crawl web sites in http or https to download data, but just can't make it work anymore with this new server. I feel this might be due to the server trying to refuse useragents deliberately. I read some post in this thread which suggested to impersonate a Firefox request. I tried my best in this regard all to no avail.

More specifically, I'm trying to use LWP useragent to automate the collection of data from a site, but was always get refused with this message

500 Can't connect to tutorialregistration.uws.edu.au:443 (SSL connect +attempt failed because of handshake problemserror:00000000:lib(0):fun +c(0):reason(0))

I have narrowed it down to accessing just the URL https://tutorialregistration.uws.edu.au/aplus/admin/adminLogin.do which I can directly access from a browser, but failed with the above message when using LWP Useragent.

This can be shown via

perl -MLWP::Simple -e "getprint 'https://tutorialregistration.uws.edu. +au/aplus/admin/adminLogin.do'"

or

use LWP::UserAgent; $ua = new LWP::UserAgent; $req = new HTTP::Request 'GET' => 'https://tutorialregistration.uws.edu.au/aplus/admin/adminLogin.do'; # impersonate a firefox brower $ua->agent("Mozilla/5.0 (Windows NT 6.1; rv:29.0) Gecko/20100101 Firef +ox/29.0"); $req->header( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/* +;q=0.8', 'Accept-Language' => 'en-US,en;q=0.5', 'Accept-Encoding' => 'gzip, deflate', 'Cookie' => '', 'Referer' => 'https://www.uws.edu.au/', 'Connection' => 'keep-alive', ); $res = $ua->request($req); print "content-type:text/html\n\n"; print $res->content;

In both cases, if I replace the webpage URL by another https page (inside or outside Intranet), they both work fine. I really can't figure out what has gone wrong here. Please help. Many thanks.

David

In reply to Request by LWP Useragent refused by the web server but not by others by epoch4life

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.