in reply to Re: Handling HTML special characters correctly
in thread Handling HTML special characters correctly

Just want to point out that you don't need to convert the code-point \xA3 to &pound; when outputting it. If you are only using latin-1 characters, you shouldn't have to use encode_entities on anything but the special HTML characters: <, >, &, and ".

The code-point \xA3 is directly representable in latin-1 and utf-8 (and any other reasonable encoding you would use for your web page.) You only have to use encode_entities on those code-points which are not directly representable by the character set (encoding) used for your page.

Replies are listed 'Best First'.
Re^3: Handling HTML special characters correctly
by monarch (Priest) on Jul 02, 2008 at 22:11 UTC
    ..although "\x{A3}" in UTF-8 would be encoded as two bytes ("\x{C2}\x{\xA3}"). See UTF-8 encoding table.

    Update: removed superfluous parenthesis.

      £ never used to cause me a problem on the old RH9. But these days most web servers seem to be set to en_us.UTF-8, where outputting £ will give you a nasty ? in the browser, needs to be &pound; these days.

      On a side note just noticed something annoying about HTML::Entities, if your input is already encoded, such as &pound;, you'll get &amp;pound;, thought it would have checked for encoded characters and skipped them?


      Lyle
        On a side note just noticed something annoying about HTML::Entities, if your input is already encoded, such as £, you'll get &pound;,
        well, how would you encode it? given an input like
        If you write &amp; in HTML, it turns out as &
        The expected output of such a text after encoding would be:
        If you write &amp;amp; in HTML, it turns out as &amp;
        Now you're saying, only the last ampersand should be escaped? Because the first one is already escaped? No, you never know if a text is already escaped.

        thought it would have checked for encoded characters and skipped them?

        That would be BAD! Decoding and encoding your post would change "such as &pound;" to "such as £".

        On a side note just noticed something annoying about HTML::Entities, if your input is already encoded

        Duh?

        The same thing happens if you try to encode characters twice.
        The same thing happens if you try to encode URL characters twice.
        The same thing happens if you try to zip a string twice.
        etc