can't get rid of BOM from UTF-8 webpage

BeneSphinx has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that reads a UTF-8 encoded webpage (actually just a text file on a website). It is in UTF-8 with the byte-order-mark (BOM) sequence, although the Content Type header is just text/plain. I get it with code like:

use LWP::UserAgent;
use XML::Simple;
use Encode qw(encode decode);
use warnings;
use strict;

...

my $content;
my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new("GET",$url);
unless ($skipAuth){
  $request->authorization_basic($user, $pass);
}
$ua->prepare_request($request);
my $response = $ua->send_request($request);
if ($pageFormat eq "xml"){
  $content = XMLin($response->decoded_content((charset => "utf8")));
}
else { #txt
  $content = $response->decoded_content((charset => "utf8"));  

  #Prints UTF-8 as expected.
  print "CONTENT CHARSET " . $response->content_charset() . "\n\n";
  
  #All of the below statements print the BOM as literal characters to 
+the Windows CMD screen
  print "CONTENT: " . $response->content() . "\n";
  print "DECODED CONTENT: " . $response->decoded_content() . "\n";
  print "DECODED CONTENT WITH UTF-8 SPECIFIED: " . $response->decoded_
+content((charset => "utf8")) . "\n";
  print "MANUALLY DECODED: " . decode("UTF-8", $response->content());
}
[download]

Whenever I run this in Windows CMD prompt, I always get the BOM marks (∩╗┐) printed to screen. I've seen lots of suggestions, from switching the Windows code page to UTF-8 ("chcp 65001"), to decoding or encoding at various stages, but nothing works.

When I print to file, however, I get a file that both Notepad and Notepad++ can read without the BOM. I think they both detect it as UTF-8 and hide the BOM:

open(RESULT, ">result.txt");
print RESULT $content;
close(RESULT);
[download]

When I run "type result.txt" from the cmd prompt, it spits out the file contents with the BOM showing again.

So, it seems that throughout the process, Perl, Notepad, and Notepad++ correctly and consistently treat the text as UTF-8. What's odd is that the CMD prompt doesn't, and always shows those marks, even after I change the code page to 65001.

My first question is why the CMD prompt isn't handling the BOM correctly, even after being told to use the UTF-8 code page. My second question is why Perl insists on keeping the BOM and printing it later. I would have expected it to be stripped during the initial read of the text file, since it's just packaging, and omitted in Perl's internal character representation.

Overall, though, I'd like to learn where to fix the problem. Do I configure Windows differently? Do I read the text file differently in Perl? Or do I just print things differently in Perl? Any insights or suggestions will be greatly appreciated.

Comment on can't get rid of BOM from UTF-8 webpage Select or Download Code

Replies are listed 'Best First'.
Re: can't get rid of BOM from UTF-8 webpage by Anonymous Monk on May 20, 2012 at 08:15 UTC
Hi :) My second question is why Perl insists on keeping the BOM and printing it later Because it would be insane to throw it away without being told to throw it away. I would have expected it to be stripped during the initial read of the text file, since it's just packaging, and omitted in Perl's internal character representation. Besides not being mere packaging it isn't "omitted"; Your expectations is wrong. Overall, though, I'd like to learn where to fix the problem. Do I configure Windows differently? Do I read the text file differently in Perl? Or do I just print things differently in Perl? Any insights or suggestions will be greatly appreciated. for cmd.exe change fonts, I read fonts are responsible for not showing BOM or try PowerShell, I hear that thing is unicode by default, so it ought to come with fonts that know to hide BOM or from perl, strip the bom , say by using `:encoding(UTF-8):via(File::BOM)`, and/or skip printing BOM when `-t Filehandle is opened to a tty` (tty means console, cmd.exe ) I've seen lots of suggestions ... Next time, include those links in your post :) FWIW, Content-type is not charset FWIW, utf8 is not UTF-8, the difference could be important BUT, FWIW, you shouldn't specify charset (utf8 or UTF-8) to decoded_content, that is webservers job , it should just work already My first question is why the CMD prompt isn't handling the BOM correctly, seems to me something on MSDN would answer that :p	[reply]
Re^2: can't get rid of BOM from UTF-8 webpage by BeneSphinx (Sexton) on May 20, 2012 at 20:25 UTC
Thanks to both of you for your responses, that helps clarify things. Looks like it's a CMD issue which can be solved in Perl with a little manual work. Fortunately the display isn't critical to my project. Since you asked, I found the suggestion for changing the code page to 65001 here: http://stackoverflow.com/questions/379240/is-there-a-windows-command-shell-that-will-display-unicode-characters This was also a helpful walkthrough: http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using For the encoding/decoding suggestions, my main source was just the Perl Unicode Tutorial.	[reply]
Re: can't get rid of BOM from UTF-8 webpage by mbethke (Hermit) on May 20, 2012 at 15:34 UTC
The codepage doesn't have anything to do with it---either the shell strips the BOM internally or it doesn't. However if it implemented Unicode correctly it should render the BOM as an invisible characters. Of course a BOM is completely superfluousš in UTF-8 (Notepad BTW is notorious for writing one anyway) and I agree it could well be discarded upon reading. As it doesn't, just strip it out as suggested in the post above. š OK, it <em<could serve to identify UTF-8 with just the first couple of bytes if it were consistently applied, but as it's not recommended by the standard, hardly anyone does it so identification of unknown files always has to rely on larger data chunks anyway.	[reply]