merlinX has asked for the wisdom of the Perl Monks concerning the following question:

I want to convert <U+xxxx> literals into wide hexadecimal values, how do I do that? Probably trivial, but I don't see it now...

use Encode; use encoding 'utf8'; my $test = "<U+010C>"; $test=~s/\<U\+(.*)\>/\\x\{$1\}/g; print "$test\n"; # prints \x{010C} my $probe1=encode("utf8","\x{010C}"); # works my $probe2=encode("utf8","$test"); # does not work?

Replies are listed 'Best First'.
Re: convert scalar to wide hexadecimal value? how?
by JavaFan (Canon) on Jan 14, 2009 at 11:16 UTC
    \x{010C} is only a equal to Č if it appears in a string literal. Otherwise, you may want to use 'chr'. Two ways of converting "<U+010C>":
    $text =~ s/<U\+([0-9A-Fa-f]+)>/qq!"\\x{$1}"!/ee; $text =~ s/<U\+([0-9A-Fa-f]+)>/chr hex $1/e;
      Indeed, thanks ... regex is not my strong point anyway, so hence a little additional question, suppose I have multiple <U+xxxx> codes in the string like for instance  "TEST <U+010C> <U+0158> <U+0147>"... how does the regexp look like then?
        Just found it ... I guess I have to add the g of global right?
        $text =~ s/<U\+([0-9A-Fa-f]+)>/qq!"\\x{$1}"!/gee;
Re: convert scalar to wide hexadecimal value? how?
by moritz (Cardinal) on Jan 14, 2009 at 12:07 UTC
    Try s/<U\+([a-fA-F\d]+)>/chr hex $1/eg. (In Perl regexes < has no special meaning, so you can omit the backslash before it. If you want a word boundary, use \b instead.)
      The pattern of your suggestion is almost the same as my second suggestion - except that you're using \d instead of 0-9. As a result, your solution is going to try change: "<U+٣٢>"

      But the result isn't what you hope for. Unfortunally, \d matches hundreds of characters functions and operators dealing with (hex)numbers cannot deal with.