in reply to Re^2: HTML::Entities and Unicode quotes
in thread HTML::Entities and Unicode quotes

Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.

A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.

Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

There is open. It's not perfect, but it'll do a lot. It can handle STDIN, STDOUT and STDERR, and it can the default for open.

Replies are listed 'Best First'.
Re^4: HTML::Entities and Unicode quotes
by tod222 (Pilgrim) on Aug 23, 2011 at 03:46 UTC
    Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.
    Yes, I saw that after I posted. Got distracted mid-post and when I got back didn't revisit the node to see the new replies.
    It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.
    is_utf8 is even misleadingly named. I'd have called it "needs_utf8". But I won't use it.

    Regarding

    use open ':encoding(utf8)';
    in my earlier traversal of perlunifaq I saw
    Using :utf8 for input can sometimes result in security breaches, so please use :encoding(UTF-8) instead.
    in the answer to What is the difference between :encoding and :utf8? Is ':encoding(utf8)' the same as ':encoding(UTF-8)'?

      is_utf8 is even misleadingly named. I'd have called it "needs_utf8".

      As in needs to be encoded using UTF-8? No, it doesn't indicate a need for the string to be encoded, using UTF-8 or otherwise.

      It is actually accurately named, but refers to how the string is stored internally, not the content of the string.

      Is ':encoding(utf8)' the same as ':encoding(UTF-8)'?

      The encoding is called UTF-8, so use "UTF-8" (case doesn't matter). I don't know how :encoding(utf8) is different, but I don't see any reason for figuring it out.