rfg has asked for the wisdom of the Perl Monks concerning the following question:

It appears to me to be the case that at least some web servers out there react badly (i.e. give errors or wrong content) if a full URL is present within, for example, an HTTP/1.1 request, preferring instead to see requests that have this general form:
GET abspath Host: hostname
Assuming that I am correct that some web servers really do demand this form (in at least some cases), please consider the following simple Perl code:
#!/usr/local/bin/perl -w use strict; use HTTP::Request; use LWP::UserAgent; my $orig_url = shift @ARGV; my ($host, $tail); if ($orig_url =~ m%^http://([^/]+)(.*)%) { ($host, $tail) = ($1, $2); } else { die ("Invalid URL: $orig_url\n"); } my $http_request = HTTP::Request->new('GET', $tail, [ 'Host', $host ]) +; $http_request->protocol('HTTP/1.1'); print $http_request->as_string; my $ua = LWP::UserAgent->new (); my $http_response = $ua->request ($http_request); my $http_response_code = $http_response->code; print "HTTP response code: $http_response_code\n"; print $http_response->decoded_content;
When the above is executed with a full URL as the single command line argument, for example
http://www.tristatelogic.com/index.html
The results is as follows:
GET /index.html HTTP/1.1 Host: www.tristatelogic.com HTTP response code: 400 400 URL must be absolute
I could be wrong, but my impression after having attempted to research this, is that the LWP::UserAgent->request method is being rather entirely unhelpful and unfriendly here, i.e. by refusing to get what it needs... in particular the host name to which the request must be sent... out of the Host: header, rather than out of the GET request line itself.

Would you all say that I am correct that, at the very least, this could be characterized as a "non-feature" of the LWP::UserAgent->request method? (I hesitate to call it a bug, even though, at the moment, it does feel like one to me.)

P.S. At the moment, the only work-around for this issue/problem appears to be for the user (i.e. me) to get down and manually use socket programming to send the request to the server in the exact form I need it to be in. If I am missing some other solution, please edify me. Thanks.

Replies are listed 'Best First'.
Re: LWP::UserAgent non-feature?
by Corion (Patriarch) on Jan 13, 2015 at 08:38 UTC

    What LWP::UserAgent sends is perfectly valid and also perfectly common.

    Personally, I would investigate what the server thinks it is receiving. See RFC 7230, section 5.3 on how the line could be formatted.

    Also, in your case, I would investigate whether you're connecting to an HTTP proxy (which needs the full HTTP URL) instead of a HTTP server. In that case, maybe you did not set $ENV{HTTP_PROXY} or LWP::UserAgent picked up the wrong variables for that.

Re: LWP::UserAgent non-feature? (docs)
by tye (Sage) on Jan 13, 2015 at 15:40 UTC
    I could be wrong, but my impression after having attempted to research this, is that the LWP::UserAgent->request method is being rather entirely unhelpful and unfriendly here

    It looks like LWP::UserAgent is not the one splitting up the URL. It is *your* code that is doing that.

    I'm guessing that you are under the impression that this is required of you for some reason. And yet, one of the very first lines of the documentation for one of the modules you use is:

    $request = HTTP::Request->new(GET => 'http://www.example.com/');

    So, clearly, the module expects to get a full URL. Were you assuming that the module is too stupid to do any processing on that URL and would just send "GET http://www.example.com/ HTTP/1.1"?

    Perhaps you got this impression from looking at what as_string() returns? But the documentation says:

    $r->as_string $r->as_string( $eol )

    Method returning a textual representation of the request.

    It doesn't say "returns the string that will be sent to the HTTP server".

    If you really need to see the code that constructs the actual HTTP request, then read the request() method in LWP::Protocol::http.

    Change your code to stop doing naive splitting of the URL and use the modules as documented and you will probably get better results.

    - tye        

      Thanks for the replies folks. At this juncture, I should make a few brief points, and then just leave this "non-feature" be.

      1) After posting last nite, I traced the problem I was having with the server sending back "incorrect" results to my own programatic faux pas. My original test code (which I did not post) inadvertantly was lower casing the entire URL inappropriately, and that was the real problem. Once fixed, everything else worked beautifully. I am suitably humble that my own mistake was the primary cause of my posting here.

      2) Regardless of the above, the output generated by the as_string method of HTTP::Request is inherently prone to causing confusion, I think, given that the string generated by that method is not, apparently, what actually gets sent to the server in at least some cases.

      3) There is no compelling reason that I can see why the request method of LWP::UserAgent either cannot be or should not be smart enough and/or helpful enough to be able to obtain the server hostname from a Host: header which is present within the HTTP::Request object it is given. But it does appear to be the case that it does not do so, preferring instead to yield a simulated 400 error response from the actual server, which is itself arguably problematic, i.e. in that this faked server response might cause... and apparently has caused... some users to believe that this response is actually coming from the real server. (Wouldn't it perhaps be better to throw an exception, or some such other thing, rather than internally generating a confusing faked server response?)

        Hmm, whenever LWP "fakes" a response, it identifies it as such "Client-Warning" header set to the value "Internal response".
Re: LWP::UserAgent non-feature?
by Anonymous Monk on Jan 13, 2015 at 08:41 UTC
    What does the program actually send?