special HTML Characters

chuck_norris has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: special HTML Characters by ikegami (Patriarch) on Apr 08, 2008 at 11:24 UTC
` `, also known as ` ` is the U+00A0: NO-BREAK SPACE. `∼` is U+223C: TILDE OPERATOR. HTML::Entities properly handles both just fine: `>perl -e"use HTML::Entities qw( decode_entities ); printf('U+%04X', or +d(decode_entities($ARGV[0])))" " " U+00A0 >perl -e"use HTML::Entities qw( decode_entities ); printf('U+%04X', or +d(decode_entities($ARGV[0])))" "∼" U+223C` [download] I suspect you have a bug in your output code. You're probably forgot to encode the text string returned by `decode_entities` into a binary string appropriate for your terminal or the file into which you outputting the string. This can be done by adding the `:encoding(...)` layer on `open`, by adding the `:encoding(...)` layer using `binmode`, or by explicitly encoding using Encode's `encode` function.	[reply] [d/l] [select]
Re: special HTML Characters by graff (Chancellor) on Apr 08, 2008 at 23:14 UTC
As ikegami pointed out above, ` ` is the unicode non-breaking space (not a "superscript"). If you are seeing Â, it's because the original html entity is being correctly converted to utf8, turning it into the two-byte sequence `0xc2 0xa0`, and then this is being incorrectly displayed as if it were a string using a single-byte encoding (i.e. 0xc2 is the code point for Â and 0xa0 is "nbsp" in single-byte Latin-1 code pages like cp1252 and iso-8859-1). That's why ikegami mentions that you need to pay attention to how the data are being handed off to your display (i.e. use a utf8-based display, or else encode the text into whatever character set you need for the display tool that you have).	[reply] [d/l] [select]
Re: special HTML Characters by Anonymous Monk on Apr 08, 2008 at 09:21 UTC
use Encoding ...	[reply]
Re: special HTML Characters by CountZero (Bishop) on Apr 09, 2008 at 10:11 UTC
The "capital A with Tilde" is just the way your display shows you the "`&#x00A0`" character. Probably your display device/driver/program is not set-up to show the Unicode character-set. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]