Escaping Wide Characters

mobiusinversion has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Monks,

I have been breathing too deeply from the fumes of the Perl Unicode documentation.
I find myself dizzy and in the need of guidance.

The following is from

       http://perldoc.perl.org/perluniintro.html
[download]

and in particular the subsection entitled: Displaying Unicode As Text:

Sometimes you might want to display Perl scalars containing Unicode as simple ASCII (or EBCDIC) text. The following subroutine converts its argument so that Unicode characters with code points greater than 255 are displayed as \x{...} , control characters (like \n ) are displayed as \x.. , and the rest of the characters as themselves:

    sub nice_string {
        join("",
            map { $_ > 255 ? 
                sprintf("\\x{%04X}", $_) :  
                chr($_) =~ /[[:cntrl:]]/ ? 
                sprintf("\\x%02X", $_) :    
                quotemeta(chr($_))         
            } unpack("W*", $_[0])
         )           
     }
[download]

My question is this: How do you undo that operation to recover the original scalar?

Dave

Comment on Escaping Wide Characters Select or Download Code

Replies are listed 'Best First'.
Re: Escaping Wide Characters by ikegami (Patriarch) on Mar 05, 2008 at 18:08 UTC
That code (`unpack 'W'` specifically) only works in 5.10. Below is a version that works in 5.8+ (when unicode support was added), and a reverse function for 5.8+ that safe for use on untrusted strings. sub escape_5_10 { join '', map { $_ > 255 ? sprintf('\\x{%04X}', $_) : chr() =~ /[[:cntrl:]]/ ? sprintf('\\x%02X', $_) : quotemeta(chr()) } unpack('W', @_ ? $_[0] : $_) } sub escape { join '', map { ord() > 255 ? sprintf('\\x{%04X}', ord()) : /[[:cntrl:]]/ ? sprintf('\\x%02X', ord()) : quotemeta() } map /./gs, @_ ? $_[0] : $_ } sub unescape { my $s = @_ ? $_[0] : $_; $s =~ s/ \G (?: \\x \{ ([0-9a-fA-F]+) \} \| \\x ([0-9a-fA-F]{1,2}) \| \\(.) \| # No escapes ) ([^\\]) / ( defined($1) ? chr(hex($1)) : defined($2) ? chr(hex($2)) : defined($3) ? $3 : '' ) . $4 /xesg; $s } # XXX Assumes. Good enough. Avoids warn. binmode STDOUT, ':encoding(UTF-8)'; my $s = '<3' # \W and \w . chr(0x04D2) # wide char . "\cC"; # ctrl char print("$s\n"); $s = escape($s); print("$s\n"); $s = unescape($s); print("$s\n"); [download] Update: Functions now default to using `$_` is no arg was supplied.	[reply] [d/l] [select]
Re^2: Escaping Wide Characters by almut (Canon) on Mar 05, 2008 at 18:30 UTC
That code (unpack 'W' specifically) only works in 5.10. The 5.8 docs had `unpack("U", ...)` in the `nice_string()` snippet (instead of `unpack("W", ...)` ) — which works fine, AFAICT.	[reply] [d/l] [select]
Re^3: Escaping Wide Characters by Anonymous Monk on Mar 18, 2008 at 14:31 UTC
I have some weird problems with these mappings: On an English Windows XP, the unpack("U"...) works fine, even with my Perl 5.6.1. But neither the unpack("U"...) nor the "map /./gs,..." approach works if I run exactly the same scripts on an English Windows 2003 Server platform, or on a Japanese Windows XP platform. Do you know of any general problems with perl's Unicode handling on these platforms, and do you have an idea how I could solve that?	[reply]
Re^4: Escaping Wide Characters by ikegami (Patriarch) on Mar 19, 2008 at 05:48 UTC
Re^2: Escaping Wide Characters by mobiusinversion (Beadle) on Mar 05, 2008 at 18:15 UTC
all groovy. that was awesome, thanks!	[reply]
Re: Escaping Wide Characters by almut (Canon) on Mar 05, 2008 at 17:12 UTC
`eval`-ing the resulting string should be sufficient (`\x{....}` and `\x..` is how you can specify characters in a literal double quoted string (where `..` are hex numbers)). Update: maybe I should add that, depending on security context (e.g. if someone malicious could have modified the string returned from `nice_string()` in the meantime), you might not want to blindly `eval` an arbitrary string... In that case, you could extract the `\x{....}` sequences etc. using a regex, and convert the chars individually back to unicode with `pack("U", hex(...))`	[reply] [d/l] [select]
Re^2: Escaping Wide Characters by mobiusinversion (Beadle) on Mar 05, 2008 at 18:03 UTC
oh and this too: `my($x,$y) = ('\x{263a}',undef); eval "$y = $x";` [download]	[reply] [d/l]
Re^2: Escaping Wide Characters by Anonymous Monk on Mar 05, 2008 at 17:59 UTC
thanks for the reply. how do you eval it? for example, this does not seem to work: `my($x,$y) = ('\x{263a}',undef); eval '$y = $x';` [download] and this does not compile: `my($x,$y) = ('\x{263a}',undef); $y = eval($x);` [download] what am i doing wrong?	[reply] [d/l] [select]
Re^3: Escaping Wide Characters by almut (Canon) on Mar 05, 2008 at 18:10 UTC
You need to put double quotes around the string, e.g. `my $x = '\x{263a}'; my $y = eval '"'.$x.'"'; # or my $y = eval "\"$x\"";` [download]	[reply] [d/l]
Re^3: Escaping Wide Characters by ikegami (Patriarch) on Mar 05, 2008 at 18:10 UTC
`my $y = eval qq{"$x"};` [download] but see my other post for a safer alternative.	[reply] [d/l]
Re: Escaping Wide Characters by vdrab (Initiate) on Mar 06, 2008 at 05:17 UTC
Data::Dumper is your friend. I usually just convert the scalar to unicode using Encode, and then dump it. Use load to reverse the process. `#!/usr/bin/perl -wl $encoding = shift @ARGV; $word = shift @ARGV; defined $encoding and defined $word or die 'give me an encoding and a +word to convert'; use Encode; use Data::Dumper q(Dumper); print Dumper( decode $encoding, $word );` [download]	[reply] [d/l]