BeneSphinx has asked for the wisdom of the Perl Monks concerning the following question:
I have a script that reads a UTF-8 encoded webpage (actually just a text file on a website). It is in UTF-8 with the byte-order-mark (BOM) sequence, although the Content Type header is just text/plain. I get it with code like:
use LWP::UserAgent; use XML::Simple; use Encode qw(encode decode); use warnings; use strict; ... my $content; my $ua = LWP::UserAgent->new; my $request = HTTP::Request->new("GET",$url); unless ($skipAuth){ $request->authorization_basic($user, $pass); } $ua->prepare_request($request); my $response = $ua->send_request($request); if ($pageFormat eq "xml"){ $content = XMLin($response->decoded_content((charset => "utf8"))); } else { #txt $content = $response->decoded_content((charset => "utf8")); #Prints UTF-8 as expected. print "CONTENT CHARSET " . $response->content_charset() . "\n\n"; #All of the below statements print the BOM as literal characters to +the Windows CMD screen print "CONTENT: " . $response->content() . "\n"; print "DECODED CONTENT: " . $response->decoded_content() . "\n"; print "DECODED CONTENT WITH UTF-8 SPECIFIED: " . $response->decoded_ +content((charset => "utf8")) . "\n"; print "MANUALLY DECODED: " . decode("UTF-8", $response->content()); }
Whenever I run this in Windows CMD prompt, I always get the BOM marks () printed to screen. I've seen lots of suggestions, from switching the Windows code page to UTF-8 ("chcp 65001"), to decoding or encoding at various stages, but nothing works.
When I print to file, however, I get a file that both Notepad and Notepad++ can read without the BOM. I think they both detect it as UTF-8 and hide the BOM:
When I run "type result.txt" from the cmd prompt, it spits out the file contents with the BOM showing again.open(RESULT, ">result.txt"); print RESULT $content; close(RESULT);
So, it seems that throughout the process, Perl, Notepad, and Notepad++ correctly and consistently treat the text as UTF-8. What's odd is that the CMD prompt doesn't, and always shows those marks, even after I change the code page to 65001.
My first question is why the CMD prompt isn't handling the BOM correctly, even after being told to use the UTF-8 code page. My second question is why Perl insists on keeping the BOM and printing it later. I would have expected it to be stripped during the initial read of the text file, since it's just packaging, and omitted in Perl's internal character representation.
Overall, though, I'd like to learn where to fix the problem. Do I configure Windows differently? Do I read the text file differently in Perl? Or do I just print things differently in Perl? Any insights or suggestions will be greatly appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: can't get rid of BOM from UTF-8 webpage
by Anonymous Monk on May 20, 2012 at 08:15 UTC | |
by BeneSphinx (Sexton) on May 20, 2012 at 20:25 UTC | |
|
Re: can't get rid of BOM from UTF-8 webpage
by mbethke (Hermit) on May 20, 2012 at 15:34 UTC |