coltman has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I met a weird case when I tried to download the webpage "http://securities.stanford.edu/1008/UTIIQ96/" using LWP::Simple::get() and save the $content to a txt file.

The weird thing is that if I open the txt file using some editor (e.g., UltraEdit), it shows perfectly normal:

<HTML><HEAD><TITLE>Unitech Industries, Inc. - Securities Class Action</TITLE>

However, if I use "print $content" during the downloading . The log shows something differently:

< H T M L > < H E A D > < T I T L E > U n i t e c h I n d u s t r i e s , I n c . - S e c u r i t i e s C l a s s A c t i o n < / T I T L E >

It just adds a space after every character.

When I try to use regex to extract information, the space issue just haunted me all the time as perl will always read the txt file as if it has the extra space!

I will appreciate it if someone can give me some hint on the cause and solution to the problem.

Thank you!

Replies are listed 'Best First'.
Re: LWP problem
by kyle (Abbot) on Sep 08, 2008 at 15:50 UTC

    My browser thinks that page is "UTF-16 (Little endian)" encoded. I did this:

    use LWP::Simple; use Encode; my $p = get( 'http://securities.stanford.edu/1008/UTIIQ96/' ); my $d = decode( 'UTF-16LE', substr( $p, 2 ), 1 );

    After that, $d comes out without all the null characters. I'm using a substr of $p so as to skip over the Byte-order mark.

Re: LWP problem
by betterworld (Curate) on Sep 08, 2008 at 15:40 UTC

    It seems like that document is in UTF-16 encoding. Try this:

    use LWP::Simple; use Encode; my $x = get("http://securities.stanford.edu/1008/UTIIQ96/"); print length $x, "\n"; # prints 59810 $x = decode("utf-16", $x); print length $x, "\n"; # prints 29904 print $x; # prints the document
Re: LWP problem
by deus.lemmus (Initiate) on Sep 10, 2008 at 13:31 UTC
    That looks like your seeing the text as UTF16 in the second case. You could try running it through something of the Encode::Decode family to convert it to UTF8 (or some other encoding) if you need to.