decode/encode - can someone explain this please

morgon has asked for the wisdom of the Perl Monks concerning the following question:

I run CGI::Application::Plugin::TT under mod_perl and need to send either UTF-8 or ISO-8859-1 encoded xml (generated by templates) back to the client.

I have asked about this before but it seems that changing the encoding of STDOUT via binmode does not work under mod_perl.

I have now found a solution that works for me, but it is not entirely clear to me..

What I do is let TT generate the xml in utf8 and then change the encoding in a cgiapp_postrun method like this:

use Encode;

sub cgiapp_postrun {
  my($this, $response_ref)=@_;

  my $encoding = ... #figure out which encoding to use

  if($encoding ne "UTF-8") {
       my $response_utf8 = decode('UTF-8', $$response_ref);
       $$response_ref = encode('ISO-8859-1', $response_utf8);
  }
[download]

This works but what I don't understand is why it is neccessary to first call decode (it does not work without it).

I always thought that by default a Perl-string would be utf8, so I had assumed that I can call encode straight away, but this evidently is wrong...

So can someone please explain what is going on?

Many thanks!

Comment on decode/encode - can someone explain this please Select or Download Code

Replies are listed 'Best First'.
Re: decode/encode - can someone explain this please by Anonyrnous Monk (Hermit) on Jan 07, 2011 at 23:02 UTC
why it is neccessary to first call decode Because the string needs to have the utf8 flag turned on for Perl to treat it as a string of characters. And decoding the input (even if it's already in UTF-8, which very closely resembles to what Perl is using internally) is the safest way to turn on the utf8 flag. Generally, for most practical purposes, it's best to think of Perl's internal unicode encoding as some opaque format that's not your business, and simply decode your inputs and encode your outputs. (What is considered "input" depends on the context. From what you say, we can infer that TT encoded its output as UTF-8 (turning off the utf8 flag), so you have to treat it as if it were any other external UTF-8 encoded input, even if it's just being passed around program-internally.)	[reply]
Re: decode/encode - can someone explain this please by ikegami (Patriarch) on Jan 08, 2011 at 00:04 UTC
The internal arrangement of a scalar is no relevance to this conversation. It's the string in the scalar that matters. A string can be text, UTF-8, iso-latin-1, a packed integer. In fact, it's suppose to be able to hold any sequence of 32-bit or 64-bit integers (depending on your build). Your string apparently contains text encoded into bytes using UTF-8 (UTF-8 text), but `encode` requires decoded Unicode text. You can get what you need from what you have using `decode`. Update: Removed a distracting paragraph.	[reply] [d/l] [select]