mobiusinversion has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Monks,

I have been breathing too deeply from the fumes of the Perl Unicode documentation.
I find myself dizzy and in the need of guidance.

The following is from
http://perldoc.perl.org/perluniintro.html
and in particular the subsection entitled: Displaying Unicode As Text:

Sometimes you might want to display Perl scalars containing Unicode as simple ASCII (or EBCDIC) text. The following subroutine converts its argument so that Unicode characters with code points greater than 255 are displayed as \x{...} , control characters (like \n ) are displayed as \x.. , and the rest of the characters as themselves:
sub nice_string { join("", map { $_ > 255 ? sprintf("\\x{%04X}", $_) : chr($_) =~ /[[:cntrl:]]/ ? sprintf("\\x%02X", $_) : quotemeta(chr($_)) } unpack("W*", $_[0]) ) }

My question is this: How do you undo that operation to recover the original scalar?

Dave

Replies are listed 'Best First'.
Re: Escaping Wide Characters
by ikegami (Patriarch) on Mar 05, 2008 at 18:08 UTC
    That code (unpack 'W' specifically) only works in 5.10. Below is a version that works in 5.8+ (when unicode support was added), and a reverse function for 5.8+ that safe for use on untrusted strings.
    sub escape_5_10 { join '', map { $_ > 255 ? sprintf('\\x{%04X}', $_) : chr() =~ /[[:cntrl:]]/ ? sprintf('\\x%02X', $_) : quotemeta(chr()) } unpack('W*', @_ ? $_[0] : $_) } sub escape { join '', map { ord() > 255 ? sprintf('\\x{%04X}', ord()) : /[[:cntrl:]]/ ? sprintf('\\x%02X', ord()) : quotemeta() } map /./gs, @_ ? $_[0] : $_ } sub unescape { my $s = @_ ? $_[0] : $_; $s =~ s/ \G (?: \\x \{ ([0-9a-fA-F]+) \} | \\x ([0-9a-fA-F]{1,2}) | \\(.) | # No escapes ) ([^\\]*) / ( defined($1) ? chr(hex($1)) : defined($2) ? chr(hex($2)) : defined($3) ? $3 : '' ) . $4 /xesg; $s } # XXX Assumes. Good enough. Avoids warn. binmode STDOUT, ':encoding(UTF-8)'; my $s = '<3' # \W and \w . chr(0x04D2) # wide char . "\cC"; # ctrl char print("$s\n"); $s = escape($s); print("$s\n"); $s = unescape($s); print("$s\n");

    Update: Functions now default to using $_ is no arg was supplied.

      That code (unpack 'W' specifically) only works in 5.10.

      The 5.8 docs had unpack("U*", ...) in the nice_string() snippet (instead of unpack("W*", ...) ) — which works fine, AFAICT.

        I have some weird problems with these mappings: On an English Windows XP, the unpack("U*"...) works fine, even with my Perl 5.6.1.

        But neither the unpack("U*"...) nor the "map /./gs,..." approach works if I run exactly the same scripts on an English Windows 2003 Server platform, or on a Japanese Windows XP platform. Do you know of any general problems with perl's Unicode handling on these platforms, and do you have an idea how I could solve that?

      all groovy.

      that was awesome, thanks!
Re: Escaping Wide Characters
by almut (Canon) on Mar 05, 2008 at 17:12 UTC

    eval-ing the resulting string should be sufficient (\x{....} and \x.. is how you can specify characters in a literal double quoted string (where .. are hex numbers)).

    Update: maybe I should add that, depending on security context (e.g. if someone malicious could have modified the string returned from nice_string() in the meantime), you might not want to blindly eval an arbitrary string...  In that case, you could extract the \x{....} sequences etc. using a regex, and convert the chars individually back to unicode with pack("U", hex(...))

      oh and this too:
      my($x,$y) = ('\x{263a}',undef); eval "$y = $x";
      thanks for the reply. how do you eval it? for example, this does not seem to work:
      my($x,$y) = ('\x{263a}',undef); eval '$y = $x';
      and this does not compile:
      my($x,$y) = ('\x{263a}',undef); $y = eval($x);
      what am i doing wrong?

        You need to put double quotes around the string, e.g.

        my $x = '\x{263a}'; my $y = eval '"'.$x.'"'; # or my $y = eval "\"$x\"";
Re: Escaping Wide Characters
by vdrab (Initiate) on Mar 06, 2008 at 05:17 UTC
    Data::Dumper is your friend. I usually just convert the scalar to unicode using Encode, and then dump it. Use load to reverse the process.
    #!/usr/bin/perl -wl $encoding = shift @ARGV; $word = shift @ARGV; defined $encoding and defined $word or die 'give me an encoding and a +word to convert'; use Encode; use Data::Dumper q(Dumper); print Dumper( decode $encoding, $word );