http://qs1969.pair.com?node_id=11137936

wyt248er has asked for the wisdom of the Perl Monks concerning the following question:

I would like to convert between Unicode code points and UTF-8 character codes.

For example, the Unicode code point for the GREEK SMALL LETTER PI is U+03C0, and its UTF-8 character code is 0xCF80. So, if the string "U+03C0" (or "0x03C0") is entered, then I want the string "0xCF80" to be printed (without quotes). If the string "0xCF80" is entered, then I want the string "U+03C0" (or "0x03C0") to be printed (without quotes). Note that the desired output is NOT a character itself but a string showing the character code.

By the way, if your terminal is configured to display Unicode wide characters, then the following commands will show you the GREEK SMALL LETTER PI.

perl -l12e 'print(chr(0x03c0))' -C perl -l12e 'print(pack("U0W*", 0xCF, 0x80))' -C

Thank you in advance.

  • Comment on How to convert between Unicode codepoint and UTF8 character code on Perl?
  • Download Code

Replies are listed 'Best First'.
Re: How to convert between Unicode codepoint and UTF8 character code on Perl?
by hippo (Bishop) on Oct 24, 2021 at 11:13 UTC

    TIMTOWTDI but here's an illustrative test. See also How to ask better questions using Test::More and sample data

    use strict; use warnings; use Encode 'encode'; use Unicode::Char; use Test::More tests => 2; is to_bytes ('0x03C0'), 'CF80', 'Code point to bytes (0x format)'; is to_bytes ('U+03C0'), 'CF80', 'Code point to bytes (U+ format)'; sub to_bytes { my $u = Unicode::Char->new; (my $in = shift) =~ s/^(?:0x|U\+)//; return uc unpack 'H*', encode 'UTF-8', $u->u ($in); }

    Translating bytes to code points is left as an exercise.


    🦛

      @hippo

      The version of my perl is above 5.8, and hence it can surely handle Unicode. However, it does not have the Unicode module. So, `use Unicode::Char;` returns an error "Can't locate Unicode/Char.pm in @INC". Also `use Unicode;` returns an error "Can't locate Unicode.pm in @INC".

      Can you modify your subroutine `to_bytes` so that it will not use the Unicode module?

      Thank you.

        Can you modify your subroutine `to_bytes` so that it will not use the Unicode module?

        Indeed I can but there is no reason why I should since Unicode::Char is publicly available for you to download and install. See Installing Modules if you do not know how to go about that.


        🦛

Re: How to convert between Unicode codepoint and UTF8 character code on Perl?
by ikegami (Patriarch) on Oct 25, 2021 at 15:40 UTC

    if the string "U+03C0" (or "0x03C0") is entered, then I want the string "0xCF80" to be printed (without quotes).

    use Encode qw( encode_utf8 ); s{ ^ U\+ ( [0-9a-fA-F]+ ) \z }{ "0x" . uc(unpack("H*", encode_utf8(chr(hex($1))))) }xe;
    1. chr(hex($1)) gets a character with the specified value.
    2. encode_utf8(...) encodes the code point.
    3. uc(unpack("H*", ...)) converts the byte sequence to hex.

    If the string "0xCF80" is entered, then I want the string "U+03C0" (or "0x03C0") to be printed (without quotes).

    use Encode qw( decode_utf8 ); s{ ^ 0x ( (?: [0-9a-fA-F]{2} ){1,4} ) \z }{ sprintf("U+%X", ord(decode_utf8(pack("H*", $1)))) }xe;
    1. pack("H*", $1) gets a string of bytes with the specified values.
    2. decode_utf8(...) decodes to a code point.
    3. sprintf("%X", ord(...)) converts the character to hex.
Re: How to convert between Unicode codepoint and UTF8 character code on Perl?
by Anonymous Monk on Oct 24, 2021 at 06:16 UTC
    use Encode to translate between wide characters and their byte representations. Once you have the bytes, there are many ways to obtain their hex representations, for example, unpack "H*", $bytes.