in reply to LWP not returning leading spaces in web page (ver 2)

I think it's the save-as from Moz that's making the changes rather than LWP.

If I dump the source using lynx -source it matches the LWP dump. Ditto for using wget.

Trust Perl. Perl is your friend ;-)

Replies are listed 'Best First'.
Re: Re: LWP not returning leading spaces in web page (ver 2)
by aspen (Sexton) on Feb 01, 2003 at 22:52 UTC

    adrianh, Yes, I should (and do) trust Perl. I'm not sure that Perl or Perl modules are at fault here. However, I do believe the spaces are in the original HTML. The reason I believe that is two-fold:

    1. Other pages I am retrieving do have lines with leading spaces that Mozilla is definitely not inserting by itself (since the text and spaces have their own syntax that Mozilla knows nothing about.)
    2. LWP can retrieve other pages that have leading spaces. It seems to be specific websites and webpages where LWP is not receiving the leading spaces. HTTP://groups.yahoo.com was one such page that I knew everyone could access.

    I wonder if lynx and wget are likewise receiving the pages from the server w/o leading spaces, while Mozilla is getting those spaces. It was suggested that it might have to do with the particular Content Encoding headers inserted by Mozilla and the associated behavior of Apache. But I haven't had time to experiment more to look at those differences. I will try lynx and wget against the suggested Perl Cookbook "fake web server".

    I do have to figure out some way to retrieve my pages with the leading spaces, since they are syntactically important to parse the returned text.

    Andy

    @_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))

      Hmmm... on further investigation it looks like yahoo is returning different HTML to Moz when it requests gzipped content.

      Running

      my $ua = new LWP::UserAgent; my $request = HTTP::Request->new('GET', 'http://groups.yahoo.com/'); $request->headers->header( accept_encoding => 'x-gzip, gzip, identity', user_agent => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)', ); my $r = $ua->request($request); print $r->content;

      will print the gzipped content that you're seeing in Moz. So it's not that LWP is dropping anything, but that Moz is being given different content :-)

        Yes, exactly.

        So, how does one unzip the resulting content. I've tried using Compress::Zlib but without success. Looking around there is some information about apache adding a 10-byte header. I have tried stripping off the first 10 bytes, but in every case Zlib's inflate gives me an error code of -3, "unknown compression method".

        Do you know how to go about inflating an apache-compressed page??

        Andy

        @_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))