in reply to convert several two digit hex characters to ascii
perl -pe"s/0x((?:[0-9a-fA-F]{2})+)/pack 'H*', $1/ge" file
The above results in code equivalent to
while (<>) { s/0x((?:[0-9a-fA-F]{2})+)/pack 'H*', $1/ge; print; }
Update: Removed redundant and/or incorrect unpack 'A*'
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: convert several two digit hex characters to ascii
by gone2015 (Deacon) on Dec 02, 2008 at 16:46 UTC | |
Isn't the unpack 'A*' redundant ? (But wouldn't 'a*' be better ?) I can see the logic that what pack produces should really be unpacked before being used. Indeed, it occurred to me that unpack 'a*',... might do something bright with UTF-8. Which set me on a small quest to discover how to convert UTF-8 in hex characters to utf8 characters.... The following: gives: '\xC2\xAB aha \xC2\xBB', 9/9 not utf8 '\xC2\xAB aha \xC2\xBB', 7/9 utf8 '\xC2~ aha \x80\xC0', 9/9 not utf8 Malformed UTF-8 string in unpack at ...showing that if the string being unpacked is utf8, the result is utf8 (or error, if not valid utf8). I found, however, that pack 'H*',... returns a byte (not utf8) string, no matter what the input(s). This seems, on the whole, reasonable. I tried a number of things to try to get unpack('a*', pack('H*', $foo)) to return utf8, ... but to no avail:
unpack('a*', pack('H*', $s)) -> '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
unpack('U0a*', pack('H*', $s)) -> '\xC3\x82\xC2\xAB aha \xC3\x82\xC2\xBB', 13/13 not utf8
unpack('C0a*', pack('H*', $s)) -> '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
but note that unpack 'U0a*' is "upgrading" (as in utf8::upgrade()) the bytes to UTF-8.
I found that the trick is to tell pack to return utf8, thus: giving:
unpack('a*', pack('U0H*', C2AB2061686120C2BB)) -> '\xC2\xAB aha \xC2\xBB', 7/9 utf8
unpack('U0a*', pack('U0H*', C2AB2061686120C2BB)) -> '\xC2\xAB aha \xC2\xBB', 9/9 not utf8
unpack('C0a*', pack('U0H*', C2AB2061686120C2BB)) -> '\xC2\xAB aha \xC2\xBB', 7/9 utf8
noting that unpack 'U0a*' is treating its input as bytes.
The unpack is still optional, though invalid UTF-8 is treated differently if it's left out, thus: gives:
pack('U0H*', C2AB2041686120C2BB) -> '\xC2\xAB Aha \xC2\xBB', 7/9 utf8
unpack('a*', pack('U0H*', C2AB2041686120C2BB)) -> '\xC2\xAB Aha \xC2\xBB', 7/9 utf8
Malformed UTF-8 character (unexpected end of string) in length at ../hex-utf.pl line 23.
pack('U0H*', C27E204168612080C0) -> '\xC2~ Aha \x80\xC0', 7/9 utf8
Malformed UTF-8 string in unpack at ../hex-utf.pl line 48.
so pack is not checking for valid UTF-8, leaving it as a puzzle for others -- and in this case length() is throwing a warning. On the other hand, unpack is deeply unhappy about invalid UTF-8, and throws an error.
None of this was entirely obvious to me. Hopefully somebody can benefit from my little quest. Returning to the topic of the OP, if I wanted to decode the hex as UTF-8, I think what I would do is:
| [reply] [d/l] [select] |
by ikegami (Patriarch) on Dec 02, 2008 at 19:50 UTC | |
Depending on what you want, the following tools are probably more appropriate: Both are documented in utf8, but it's not necessary to do use utf8;. In fact, that means something different. | [reply] [d/l] [select] |