Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm using LWP::UserAgent on Windows XP to download some web pages. I'm able to download the content, but most of the whitespace gets removed so that most of the HTML appears on one line. I've tried this using ActiveState Perl and using Perl on cygwin with the same results in both cases. However, when I run the script under OS X Leopard the whitespace is preserved. Is there some basic setting that I'm missing in order to preserve the whitespace under Windows? A sample of some test code is below.

#!/usr/bin/perl use LWP::UserAgent; my $url = "http://www.somesite.com/"; my $ua = LWP::UserAgent->new; $ua->timeout(20); my $res = $ua->request(HTTP::Request->new(GET => $url)); print $res->content;

Any help would be greatly appreciated.

Thanks,
Jim

Replies are listed 'Best First'.
Re: LWP on Windows: whitespace removed from HTML
by roboticus (Chancellor) on May 05, 2008 at 21:03 UTC

    It may be that whitespace is being preserved. Unix programs to use "\n" as the End-Of-Line character, while DOS/Windows programs more frequently use "\r\n" for the End-Of-Line. When you "type" a file in a DOS/windows command window, all the text will appear on the same line. If it looks fine when you open the document in WordPad, that's likely to be your problem.

    ...roboticus

    Update: Fixed grammar.

      Thanks, but I've tried viewing the output in text editors that recognize the UNIX end-of-line character, like TextPad and WordPad, and the result is the same - the HTML appears mainly on one line. Also, it's not just end-of-line characters that are removed; whitespace used for padding at the beginning of lines is removed as well.

        I assume you did a View|Source on both the DOS and Leopard browsers and that they both show HTML formatted as you expect. The last arrows in my quiver1 are:

        1) Change the agent identity LWP uses? Perhaps the server system just happens to serve up a different version of the document for agents other than the browsers you tried.

        2) Try using a network scanner (Ethereal or some such) to view the packets as they come across the network to verify that the software stack is munching your whitespace.

        ...roboticus

        1: I don't do web/HTML stuff, so my quiver is rather sparse.

        My guess, then, is that there aren't any carriage returns in the HTML - i.e. the server isn't generating them. If you give us an example url we can verify that.
Re: LWP on Windows: whitespace removed from HTML
by ikegami (Patriarch) on May 05, 2008 at 22:45 UTC

    Just a quick confirmation: LWP doesn't do any transformation. It doesn't even know about HTML.

Re: LWP on Windows: whitespace removed from HTML
by Cody Pendant (Prior) on May 06, 2008 at 02:09 UTC
    >Change the agent identity LWP uses?

    Seconding this. I've been bitten in the past by by assuming that the source sent to my GUI browser is the same as the source sent to LWP.

    Also, if you tell us the actual site, we can test for you.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
Re: LWP on Windows: whitespace removed from HTML
by Anonymous Monk on May 06, 2008 at 03:13 UTC
    I don't know the answer to your question but you can use HTML::TreeBuilder and the as_html method to re-print the HTML with whitespace...
Re: LWP on Windows: whitespace removed from HTML
by Lu. (Hermit) on May 06, 2008 at 08:41 UTC
Re: LWP on Windows: whitespace removed from HTML
by Anonymous Monk on May 06, 2008 at 18:16 UTC
    Thanks for all of the suggestions, folks. Here is some additional information, as requested:

    1) I have done a "View Source" from within Firefox to verify that there are line breaks within the source HTML, although I don't know if those are "\r\n" or just "\n". However, my cygwin environment is set up to use UNIX line endings, and it still experiences the problem.
    2) Some of the sites I have tried include www.cnn.com and www.cdc.gov. In general, some of the line breaks within <script> elements seem to get preserved, but outside of those most of the HTML appears on a single line.
    3) I have tried changing the agent to "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", but I got the same results. I haven't tried using any other agent settings other than the default yet.
    4) I have tried using the content returned by LWP::UserAgent with HTML::TreeParser, but it didn't seem to help, although I may need to experiment more with that.

    Some additional info - I tried running my test script on another machine running Windows XP Home, and it worked fine there (i.e. whitespace was preserved), so the issue appears to be isolated to my work laptop running XP Pro. I've also tried using the Socket module to download the HTML on that machine, and I experience the same issue of whitespace being removed. Based on that, I suspect this isn't an issue specifically with LWP, but I'm not quite sure what to check next.

    Thanks again for all of the help.
      This sounds very much like a http proxy that strips whitespace from the pages before delivering them to you. Is there any proxy in your way?
        Not that I'm aware of. The laptop is using a cellular air-card, so it's not sitting behind a proxy server on the company network. The laptop does see line breaks if I view source from the browser; it's just when I use Perl to download the content that I experience the lack-of-whitespace issues.
      I meant HTML::TreeBuilder in the comment above, not HTML::TreeParser. Sorry for any confusion.
Re: LWP on Windows: whitespace removed from HTML
by Anonymous Monk on May 12, 2008 at 14:39 UTC
    Resolution: It turns out that I missed a setting for the aircard that enables data compression and acceleration. It's turned on by default. Once I turned that off, the content downloaded with the whitespace intact.

    Thanks again for all off the suggestions. Sorry for the false alarm.