Why aren't spaces from a Unicode file converting to Hexadecimal value 20?

Isolder has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file encoded in UTF16-LE, generated from another script. Reading this text file in, I convert the characters to their hexadecimal value. For some reason spaces and newlines are always showing up as 00 when I convert them to hexadecimal. Why is this not working? Thanks for any help!

#!/usr/local/ActivePerl-5.14/bin/perl -w
use feature 'unicode_strings';
use utf8;
use warnings;
use strict;

my ($file) = @ARGV;
open DATAFILE, "<:encoding(UTF16-LE)", "$file";

while (<DATAFILE>) {
    my $string = $_;

    my @list = unpack( 'A' x length($string), $string );

    print $string;
    foreach (@list) {
        my $char = $_;
        my $ordi = ord($char);
        binmode STDOUT, ":utf8";
        print "Character:\t" 
          . $char . "\t"
          . sprintf( '%2.2x', unpack( 'U0U*', $char ) ) . "\n";

    }

}
[download]

Hexadecimal for first few lines of text file. In the hex the spaces are "20 00":

FFF FE 42 00 52 00 49 00 47 00 41 00 4E 00 44 00
0A 00 44 00 61 00 6D 00 6E 00 2C 00 20 00 74 00
68 00 69 00 73 00 20 00 77 00 65 00 61 00 74 00
68 00 65 00 72 00 20 00 63 00 75 00 74 00 73 00
20 00 72 00 69 00 67 00 68 00 74 00 20 00 74 00
6F 00 20 00 74 00 68 00 65 00 20 00 62 00 6F 00
6E 00 65 00 2E 00 0A 00 57 00 68 00 61 00 74 00
20 00 74 00 68 00 65 00 2E 00 2E 00 2E 00 20 00
3F 00
[download]

Comment on Why aren't spaces from a Unicode file converting to Hexadecimal value 20? Select or Download Code

Replies are listed 'Best First'.
Re: Why aren't spaces from a Unicode file converting to Hexadecimal value 20? by ikegami (Patriarch) on Jun 29, 2011 at 23:44 UTC
First, you're not getting 00, you're getting `Missing argument in sprintf at a.pl line 19, <DATAFILE> line 1. Character: 00` [download] The root of the problem is `my @list = unpack( 'A' x length($string), $string );` [download] "`A`" trims trailing whitespace, and that includes newlines. You want "`a`". `my @list = unpack( 'a' x length($string), $string );` [download] or better yet `my @list = unpack( '(a)', $string );` [download] So, `while (<DATAFILE>) { for my $ch (unpack('(a)', $_)) { printf "Character:\t%s\t%2.2x\n", $ch, ord($ch)); } }` [download] or `while (<DATAFILE>) { for my $ord (unpack('C*', $_)) { printf "Character:\t%s\t%2.2x\n", chr($ord), $ord); } }` [download] PS - The name of the encoding is UTF-16LE, not UTF16-LE.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Why aren't spaces from a Unicode file converting to Hexadecimal value 20?
by ikegami (Patriarch) on Jun 29, 2011 at 23:44 UTC

First, you're not getting 00, you're getting

Missing argument in sprintf at a.pl line 19, <DATAFILE> line 1.
Character:              00
[download]

The root of the problem is

my @list = unpack( 'A' x length($string), $string );
[download]

"A" trims trailing whitespace, and that includes newlines. You want "a".

my @list = unpack( 'a' x length($string), $string );
[download]

or better yet

my @list = unpack( '(a)*', $string );
[download]

So,

while (<DATAFILE>) {
    for my $ch (unpack('(a)*', $_)) {
       printf "Character:\t%s\t%2.2x\n", $ch, ord($ch));
    }
}
[download]

while (<DATAFILE>) {
    for my $ord (unpack('C*', $_)) {
       printf "Character:\t%s\t%2.2x\n", chr($ord), $ord);
    }
}
[download]

PS - The name of the encoding is UTF-16LE, not UTF16-LE.

[reply]
[d/l]
[select]