Isn't the unpack 'A*' redundant ? (But wouldn't 'a*' be better ?)

I can see the logic that what pack produces should really be unpacked before being used. Indeed, it occurred to me that unpack 'a*',... might do something bright with UTF-8. Which set me on a small quest to discover how to convert UTF-8 in hex characters to utf8 characters....


The following:

use strict ; use warnings ; use Encode qw(_utf8_on) ; for my $r ("\xC2\xAB \x61\x68\x61 \xC2\xBB", "\xC2\x7E \x61\x68\x61 +\x80\xC0") { for my $utf (0..1) { _utf8_on($r) if $utf ; printf "'%s', %d/%d %s\n", raw(unpack('a*', $r)) ; } ; } ; sub raw { my ($s) = @_ ; my ($b, $q) ; { use bytes ; $b = length($s) ; $q = join '', map { ($_ >= 0x20) && ($_ <= 0x7E) ? chr($_) : spr +intf('\\x%02X', $_) } unpack('C*', $s) ; } ; return ($q, length($s), $b, utf8::is_utf8($s) ? 'utf8' : 'not utf8 +') ; } ;
gives:
  '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
  '\xC2\xAB aha \xC2\xBB', 7/9 utf8
  '\xC2~ aha \x80\xC0', 9/9 not utf8
  Malformed UTF-8 string in unpack at ...
showing that if the string being unpacked is utf8, the result is utf8 (or error, if not valid utf8).

I found, however, that pack 'H*',... returns a byte (not utf8) string, no matter what the input(s). This seems, on the whole, reasonable.

I tried a number of things to try to get unpack('a*', pack('H*', $foo)) to return utf8, ...

my $s = "C2AB2061686120C2BB" ; _utf8_on($s) ; for my $unp ('a*', 'U0a*', 'C0a*') { my ($q, $b, $l, $u) = raw(unpack($unp, pack('H*', $s))) ; print "unpack('$unp', pack('H*', \$s)) -> '$q', $l/$b $u\n" ; } ;
but to no avail:
  unpack('a*', pack('H*', $s)) -> '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
  unpack('U0a*', pack('H*', $s)) -> '\xC3\x82\xC2\xAB aha \xC3\x82\xC2\xBB', 13/13 not utf8
  unpack('C0a*', pack('H*', $s)) -> '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
but note that unpack 'U0a*' is "upgrading" (as in utf8::upgrade()) the bytes to UTF-8.

I found that the trick is to tell pack to return utf8, thus:

my $s = "C2AB2061686120C2BB" ; for my $unp ('a*', 'U0a*', 'C0a*') { printf "unpack('$unp', pack('U0H*', $s)) -> '%s', %d/%d %s\n", raw(unpack( $unp, pack('U0H*', $s))) ; } ;
giving:
  unpack('a*', pack('U0H*', C2AB2061686120C2BB)) -> '\xC2\xAB aha \xC2\xBB', 7/9 utf8
  unpack('U0a*', pack('U0H*', C2AB2061686120C2BB)) -> '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
  unpack('C0a*', pack('U0H*', C2AB2061686120C2BB)) -> '\xC2\xAB aha \xC2\xBB', 7/9 utf8
noting that unpack 'U0a*' is treating its input as bytes.

The unpack is still optional, though invalid UTF-8 is treated differently if it's left out, thus:

for my $s ("C2AB2041686120C2BB", "C27E204168612080C0") { printf "pack('U0H*', $s) -> '%s', %d/%d %s\n", raw(pack('U0H*', $s)) ; printf "unpack('a*', pack('U0H*', $s)) -> '%s', %d/%d %s\n", raw(unpack('a*', pack('U0H*', $s))) ; } ;
gives:
  pack('U0H*', C2AB2041686120C2BB) -> '\xC2\xAB Aha \xC2\xBB', 7/9 utf8
  unpack('a*', pack('U0H*', C2AB2041686120C2BB)) -> '\xC2\xAB Aha \xC2\xBB', 7/9 utf8
  Malformed UTF-8 character (unexpected end of string) in length at ../hex-utf.pl line 23.
  pack('U0H*', C27E204168612080C0) -> '\xC2~ Aha \x80\xC0', 7/9 utf8
  Malformed UTF-8 string in unpack at ../hex-utf.pl line 48.
so pack is not checking for valid UTF-8, leaving it as a puzzle for others -- and in this case length() is throwing a warning. On the other hand, unpack is deeply unhappy about invalid UTF-8, and throws an error.

None of this was entirely obvious to me. Hopefully somebody can benefit from my little quest.


Returning to the topic of the OP, if I wanted to decode the hex as UTF-8, I think what I would do is:

sub dehex { my ($s) = @_ ; $s =~ s/0[xX]((?:[0-9A-Fa-f]{2})+)/pack('U0H*', $1)/eg ; return $s if utf8::valid($s) ; ... worry ... return undef ?? } ;


In reply to Re^2: convert several two digit hex characters to ascii by gone2015
in thread convert several two digit hex characters to ascii by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.