in reply to Re: LWP not returning leading spaces in web page (ver 2)
in thread LWP not returning leading spaces in web page (ver 2)

adrianh, Yes, I should (and do) trust Perl. I'm not sure that Perl or Perl modules are at fault here. However, I do believe the spaces are in the original HTML. The reason I believe that is two-fold:

  1. Other pages I am retrieving do have lines with leading spaces that Mozilla is definitely not inserting by itself (since the text and spaces have their own syntax that Mozilla knows nothing about.)
  2. LWP can retrieve other pages that have leading spaces. It seems to be specific websites and webpages where LWP is not receiving the leading spaces. HTTP://groups.yahoo.com was one such page that I knew everyone could access.

I wonder if lynx and wget are likewise receiving the pages from the server w/o leading spaces, while Mozilla is getting those spaces. It was suggested that it might have to do with the particular Content Encoding headers inserted by Mozilla and the associated behavior of Apache. But I haven't had time to experiment more to look at those differences. I will try lynx and wget against the suggested Perl Cookbook "fake web server".

I do have to figure out some way to retrieve my pages with the leading spaces, since they are syntactically important to parse the returned text.

Andy

@_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))

Replies are listed 'Best First'.
Re^3: LWP not returning leading spaces in web page (ver 2)
by adrianh (Chancellor) on Feb 02, 2003 at 12:29 UTC

    Hmmm... on further investigation it looks like yahoo is returning different HTML to Moz when it requests gzipped content.

    Running

    my $ua = new LWP::UserAgent; my $request = HTTP::Request->new('GET', 'http://groups.yahoo.com/'); $request->headers->header( accept_encoding => 'x-gzip, gzip, identity', user_agent => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)', ); my $r = $ua->request($request); print $r->content;

    will print the gzipped content that you're seeing in Moz. So it's not that LWP is dropping anything, but that Moz is being given different content :-)

      Yes, exactly.

      So, how does one unzip the resulting content. I've tried using Compress::Zlib but without success. Looking around there is some information about apache adding a 10-byte header. I have tried stripping off the first 10 bytes, but in every case Zlib's inflate gives me an error code of -3, "unknown compression method".

      Do you know how to go about inflating an apache-compressed page??

      Andy

      @_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))

        You want the gzip related methods of Compress::Zlib rather than inflate. The most direct method would be:

        use LWP; use Compress::Zlib; my $ua = new LWP::UserAgent; my $request = HTTP::Request->new('GET', 'http://groups.yahoo.com/'); $request->headers->header( accept_encoding => 'x-gzip, gzip, identity', user_agent => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)', ); my $r = $ua->request($request); my $gzipped_content = $r->content; print Compress::Zlib::memGunzip($gzipped_content);