aspen has asked for the wisdom of the Perl Monks concerning the following question:

I am resubmitting this LWP question. Thanks to some help from fokat I can now provide a short, complete program to recreate the problem.

With some websites LWP appears to be not returning leading spaces that are at the start of many lines. I need to retrieve those spaces when using LWP.

Note that LWP will return the leading spaces for many (most?) webpages, groups.yahoo.com is just one well-known site where it seems to be losing these spaces.

To replicate this issue perform these steps:

  1. Using Mozilla (or perhaps other browsers) go to http://groups.yahoo.com, View Source and save it as groups.yahoo.com.html.
  2. Run the following program, which will:
    • Use LWP to grab the same web page
    • Change some non-printable characters by:
      • Substituting spaces with dots (ie s/ /./g)
      • Inserting "<<LF>>" before each linefeed character
    • Print out the first 500 characters of the result
    • Read in the Mozilla-saved source, perform the same substitutions and print the first 500 characters of the result.
  3. Compare the results. You will see the missing leading spaces that I am struggling to keep with LWP!

Any help diagnosing this will be greatly appreciated.

The code that performs the above (after saving the Mozilla's view source to the current directory) is:

#!/usr/bin/perl use LWP; use strict; use warnings; my $ua = new LWP::UserAgent; print "\n\nUsing LWP to grap http://groups.yahoo.com.\n". "Printing first 500 characters.\n"; my $r = $ua->get("http://groups.yahoo.com"); ${$r->content_ref} =~ s/ /./g; ${$r->content_ref} =~ s/\cJ/<<LF>>\cJ/g; print substr($r->content,0,500), "\n"; print "\n\nUsing previously-saved Mozilla groups.yahoo.com source.\n". "Print first 500 characters.\n"; open FH, "groups.yahoo.com.html"; undef $/; my $s = <FH>; $s =~ s/ /./g; $s =~ s/\cJ/<<LF>>\cJ/g; print substr($s,0,500), "\n";

For those that just want to see the results, this is what is printed when I run the above program:

Using LWP to grap http://groups.yahoo.com. Printing first 500 characters. <<LF>> <HTML><<LF>> <HEAD><<LF>> <META.http-equiv="PICS-Label".content='(PICS-1.1."http://www.icra.org/ +ratingsv02.html".l.gen.true.for."http://groups.yahoo.com".r.(nz.0.vz. +0.lz.0.oz.0.ca.1))'><<LF>> <META.content="free.email.groups,.mailing.lists,.communities,.majordom +o,.e-mail,.bounce.handling,.mlm.software,.listserv,.Yahoo!.Groups,.ne +wletters,.announcement,.email.lists,.list.hosting".name=keywords><<LF +>> <META.content="Yahoo!.Groups.-.Free,.easy.email.groups".name=descripti +on><<LF>> <TITLE><<L Using previously-saved Mozilla groups.yahoo.com source. Print first 500 characters. <<LF>> <<LF>> <HTML><<LF>> <<LF>> <HEAD><<LF>> <<LF>> ........<<LF>> ........<META.http-equiv="PICS-Label".content='(PICS-1.1."http://www.i +cra.org/ratingsv02.html".l.gen.true.for."http://groups.yahoo.com".r.( +nz.0.vz.0.lz.0.oz.0.ca.1))'><<LF>> ..<<LF>> ......<META.content="free.email.groups,.mailing.lists,.communities,.ma +jordomo,.e-mail,.bounce.handling,.mlm.software,.listserv,.Yahoo!.Grou +ps,.newletters,.announcement,.email.lists,.list.hosting".name=keyword +s><<LF>> ....<META.content="Yahoo!.Group

Very strange.

Andy

@_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))

Replies are listed 'Best First'.
Re: LWP not returning leading spaces in web page (ver 2)
by Anonymous Monk on Feb 01, 2003 at 19:33 UTC
    Using the fake web server code from the Perl cookbook (p691) I compared the request headers for the 2 methods. Here are the results
    GET /index.htm HTTP/1.1 Connection: TE, close Host: localhost:8989 TE: deflate,gzip;q=0.3 User-Agent: libwww-perl/5.69 GET /index.htm HTTP/1.1 Cache-Control: no-cache Pragma: no-cache Accept: text/*, image/jpeg, image/png, image/*, */* Accept-Charset: iso-8859-1, utf-8;q=0.5, *;q=0.5 Accept-Encoding: x-gzip, gzip, identity Accept-Language: en Host: localhost:8989 User-Agent: Mozilla/5.0 (compatible; Konqueror/3; Linux)
    I'm no expert but I suspect it is something to do with the Transfer-Encoding and/or output-filtering caused by the way Apache is configured. I noticed wwww.yahoo.com has no leading spaces even in the browser.
    poj

      Interesting. I've tried modifying my LWP request to include all the headers you've shown but am still not receiving the leadings spaces.

      I'm not sure where the LWP headers you show as "TE:" and "Connection:" are being inserted. I didn't think LWP would add these headers, but I could be wrong. I need to try out the fake web server code and experiment more.

      Bottom line right now is, if anyone has other ideas I'd appreciate them.

      Andy

      @_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))
        Here is the server code I used
        #!/usr/bin/perl -w use strict; use HTTP::Daemon; my $server=HTTP::Daemon->new(Timeout => 10, LocalPort => 8989); while (my $client = $server->accept){ CONNECTION: while (my $answer = $client->get_request){ print $answer->as_string; open LOG,">>weblog.txt"; print LOG $answer->as_string;; close LOG; $client->autoflush; RESPONSE: while (<STDIN>) { last REPONSE if $_ eq ".\n"; last CONNECTION if $_ eq "..\n"; print $client $_; } print "\nEOF\n"; } print "CLOSE: ",$client->reason,"\n"; $client->close; undef $client; }
        poj
Re: LWP not returning leading spaces in web page (ver 2)
by adrianh (Chancellor) on Feb 01, 2003 at 20:50 UTC

    I think it's the save-as from Moz that's making the changes rather than LWP.

    If I dump the source using lynx -source it matches the LWP dump. Ditto for using wget.

    Trust Perl. Perl is your friend ;-)

      adrianh, Yes, I should (and do) trust Perl. I'm not sure that Perl or Perl modules are at fault here. However, I do believe the spaces are in the original HTML. The reason I believe that is two-fold:

      1. Other pages I am retrieving do have lines with leading spaces that Mozilla is definitely not inserting by itself (since the text and spaces have their own syntax that Mozilla knows nothing about.)
      2. LWP can retrieve other pages that have leading spaces. It seems to be specific websites and webpages where LWP is not receiving the leading spaces. HTTP://groups.yahoo.com was one such page that I knew everyone could access.

      I wonder if lynx and wget are likewise receiving the pages from the server w/o leading spaces, while Mozilla is getting those spaces. It was suggested that it might have to do with the particular Content Encoding headers inserted by Mozilla and the associated behavior of Apache. But I haven't had time to experiment more to look at those differences. I will try lynx and wget against the suggested Perl Cookbook "fake web server".

      I do have to figure out some way to retrieve my pages with the leading spaces, since they are syntactically important to parse the returned text.

      Andy

      @_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))

        Hmmm... on further investigation it looks like yahoo is returning different HTML to Moz when it requests gzipped content.

        Running

        my $ua = new LWP::UserAgent; my $request = HTTP::Request->new('GET', 'http://groups.yahoo.com/'); $request->headers->header( accept_encoding => 'x-gzip, gzip, identity', user_agent => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)', ); my $r = $ua->request($request); print $r->content;

        will print the gzipped content that you're seeing in Moz. So it's not that LWP is dropping anything, but that Moz is being given different content :-)

Re: LWP not returning leading spaces in web page (ver 2)
by nothingmuch (Priest) on Feb 01, 2003 at 20:55 UTC
    The only thing i see possibly having to do with it is that between the end of the header and the start of the text (not the content entity) there are 3 line breaks.

    Where HTTP/1.0 would supply the text right away, HTTP/1.1 adds a chunk length, in hex, spanning one line.

    In groups.yahoo.com i get:
    header 122a <HTML>

    with HTTP/1.1, and with HTTP/1.0 i get
    header <HTML>


    As you can see, there really is only one line break. My theory is Mozilla includes the line break succeeding 122a, the chunk length, in the source.

    But it's just a theory.

    -nuffin
    zz zZ Z Z #!perl