chuck_norris has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I'm trying to get HTML response from a certain website,parse the results and send them to another program. The response contains non-English characters, therefore I'm using the HTML::Entities module and use the command : decode_entities($response);. However, there are some characters which I cannot handle. e.g :   , ∼etc.
HTML::Entities translates  into  instead of its real value (it's a superscript). Do you have any idea how to handle these characters?

Thanks,
Chuck

Replies are listed 'Best First'.
Re: special HTML Characters
by ikegami (Patriarch) on Apr 08, 2008 at 11:24 UTC

     , also known as   is the U+00A0: NO-BREAK SPACE.
    ∼ is U+223C: TILDE OPERATOR.

    HTML::Entities properly handles both just fine:

    >perl -e"use HTML::Entities qw( decode_entities ); printf('U+%04X', or +d(decode_entities($ARGV[0])))" " " U+00A0 >perl -e"use HTML::Entities qw( decode_entities ); printf('U+%04X', or +d(decode_entities($ARGV[0])))" "∼" U+223C

    I suspect you have a bug in your output code. You're probably forgot to encode the text string returned by decode_entities into a binary string appropriate for your terminal or the file into which you outputting the string.

    This can be done by adding the :encoding(...) layer on open, by adding the :encoding(...) layer using binmode, or by explicitly encoding using Encode's encode function.

Re: special HTML Characters
by graff (Chancellor) on Apr 08, 2008 at 23:14 UTC
    As ikegami pointed out above,   is the unicode non-breaking space (not a "superscript"). If you are seeing Â, it's because the original html entity is being correctly converted to utf8, turning it into the two-byte sequence 0xc2 0xa0, and then this is being incorrectly displayed as if it were a string using a single-byte encoding (i.e. 0xc2 is the code point for  and 0xa0 is "nbsp" in single-byte Latin-1 code pages like cp1252 and iso-8859-1).

    That's why ikegami mentions that you need to pay attention to how the data are being handed off to your display (i.e. use a utf8-based display, or else encode the text into whatever character set you need for the display tool that you have).

Re: special HTML Characters
by Anonymous Monk on Apr 08, 2008 at 09:21 UTC
    use Encoding ...
Re: special HTML Characters
by CountZero (Bishop) on Apr 09, 2008 at 10:11 UTC
    The "capital A with Tilde" is just the way your display shows you the "&#x00A0" character. Probably your display device/driver/program is not set-up to show the Unicode character-set.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James