I am resubmitting this LWP question. Thanks to some help from fokat I can now provide a short, complete program to recreate the problem.

With some websites LWP appears to be not returning leading spaces that are at the start of many lines. I need to retrieve those spaces when using LWP.

Note that LWP will return the leading spaces for many (most?) webpages, groups.yahoo.com is just one well-known site where it seems to be losing these spaces.

To replicate this issue perform these steps:

  1. Using Mozilla (or perhaps other browsers) go to http://groups.yahoo.com, View Source and save it as groups.yahoo.com.html.
  2. Run the following program, which will:
    • Use LWP to grab the same web page
    • Change some non-printable characters by:
      • Substituting spaces with dots (ie s/ /./g)
      • Inserting "<<LF>>" before each linefeed character
    • Print out the first 500 characters of the result
    • Read in the Mozilla-saved source, perform the same substitutions and print the first 500 characters of the result.
  3. Compare the results. You will see the missing leading spaces that I am struggling to keep with LWP!

Any help diagnosing this will be greatly appreciated.

The code that performs the above (after saving the Mozilla's view source to the current directory) is:

#!/usr/bin/perl use LWP; use strict; use warnings; my $ua = new LWP::UserAgent; print "\n\nUsing LWP to grap http://groups.yahoo.com.\n". "Printing first 500 characters.\n"; my $r = $ua->get("http://groups.yahoo.com"); ${$r->content_ref} =~ s/ /./g; ${$r->content_ref} =~ s/\cJ/<<LF>>\cJ/g; print substr($r->content,0,500), "\n"; print "\n\nUsing previously-saved Mozilla groups.yahoo.com source.\n". "Print first 500 characters.\n"; open FH, "groups.yahoo.com.html"; undef $/; my $s = <FH>; $s =~ s/ /./g; $s =~ s/\cJ/<<LF>>\cJ/g; print substr($s,0,500), "\n";

For those that just want to see the results, this is what is printed when I run the above program:

Using LWP to grap http://groups.yahoo.com. Printing first 500 characters. <<LF>> <HTML><<LF>> <HEAD><<LF>> <META.http-equiv="PICS-Label".content='(PICS-1.1."http://www.icra.org/ +ratingsv02.html".l.gen.true.for."http://groups.yahoo.com".r.(nz.0.vz. +0.lz.0.oz.0.ca.1))'><<LF>> <META.content="free.email.groups,.mailing.lists,.communities,.majordom +o,.e-mail,.bounce.handling,.mlm.software,.listserv,.Yahoo!.Groups,.ne +wletters,.announcement,.email.lists,.list.hosting".name=keywords><<LF +>> <META.content="Yahoo!.Groups.-.Free,.easy.email.groups".name=descripti +on><<LF>> <TITLE><<L Using previously-saved Mozilla groups.yahoo.com source. Print first 500 characters. <<LF>> <<LF>> <HTML><<LF>> <<LF>> <HEAD><<LF>> <<LF>> ........<<LF>> ........<META.http-equiv="PICS-Label".content='(PICS-1.1."http://www.i +cra.org/ratingsv02.html".l.gen.true.for."http://groups.yahoo.com".r.( +nz.0.vz.0.lz.0.oz.0.ca.1))'><<LF>> ..<<LF>> ......<META.content="free.email.groups,.mailing.lists,.communities,.ma +jordomo,.e-mail,.bounce.handling,.mlm.software,.listserv,.Yahoo!.Grou +ps,.newletters,.announcement,.email.lists,.list.hosting".name=keyword +s><<LF>> ....<META.content="Yahoo!.Group

Very strange.

Andy

@_="the journeyman larry disciple keeps learning\n"=~/(.)/gs, print(map$_[$_-77],unpack(q=c*=,q@QSdM[]uRMNV^[ni_\[N]eki^y@))

In reply to LWP not returning leading spaces in web page (ver 2) by aspen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.