Isolder has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file encoded in UTF16-LE, generated from another script. Reading this text file in, I convert the characters to their hexadecimal value. For some reason spaces and newlines are always showing up as 00 when I convert them to hexadecimal. Why is this not working? Thanks for any help!


#!/usr/local/ActivePerl-5.14/bin/perl -w use feature 'unicode_strings'; use utf8; use warnings; use strict; my ($file) = @ARGV; open DATAFILE, "<:encoding(UTF16-LE)", "$file"; while (<DATAFILE>) { my $string = $_; my @list = unpack( 'A' x length($string), $string ); print $string; foreach (@list) { my $char = $_; my $ordi = ord($char); binmode STDOUT, ":utf8"; print "Character:\t" . $char . "\t" . sprintf( '%2.2x', unpack( 'U0U*', $char ) ) . "\n"; } }
Hexadecimal for first few lines of text file. In the hex the spaces are "20 00":
FFF FE 42 00 52 00 49 00 47 00 41 00 4E 00 44 00 0A 00 44 00 61 00 6D 00 6E 00 2C 00 20 00 74 00 68 00 69 00 73 00 20 00 77 00 65 00 61 00 74 00 68 00 65 00 72 00 20 00 63 00 75 00 74 00 73 00 20 00 72 00 69 00 67 00 68 00 74 00 20 00 74 00 6F 00 20 00 74 00 68 00 65 00 20 00 62 00 6F 00 6E 00 65 00 2E 00 0A 00 57 00 68 00 61 00 74 00 20 00 74 00 68 00 65 00 2E 00 2E 00 2E 00 20 00 3F 00

Replies are listed 'Best First'.
Re: Why aren't spaces from a Unicode file converting to Hexadecimal value 20?
by ikegami (Patriarch) on Jun 29, 2011 at 23:44 UTC

    First, you're not getting 00, you're getting

    Missing argument in sprintf at a.pl line 19, <DATAFILE> line 1. Character: 00

    The root of the problem is

    my @list = unpack( 'A' x length($string), $string );

    "A" trims trailing whitespace, and that includes newlines. You want "a".

    my @list = unpack( 'a' x length($string), $string );

    or better yet

    my @list = unpack( '(a)*', $string );

    So,

    while (<DATAFILE>) { for my $ch (unpack('(a)*', $_)) { printf "Character:\t%s\t%2.2x\n", $ch, ord($ch)); } }

    or

    while (<DATAFILE>) { for my $ord (unpack('C*', $_)) { printf "Character:\t%s\t%2.2x\n", chr($ord), $ord); } }

    PS - The name of the encoding is UTF-16LE, not UTF16-LE.