in reply to Re: unreadable hash keys
in thread unreadable hash keys

Thanks for all replies.

Following is the code I used to store hash table.

my $_token = {};

sub gen_tokens{

open (CID,'name.txt') or die "Cannot open name file\n";

while(<CID>){

my @words = split(/\s+/,$_);

foreach my $key (@words){

$_token->{$key}++;

}

}

}

sub print_all_tokens{

my $tokens = shift;

local *FH = shift;

while ( my ($key, $value) = each(%$tokens) ) {

print FH "$key => $value\n";

}

}

I used 'hexdump' to look at my file.I found there is a hex character '00' on ahead of my each letter.

It looks like this

00000000 00 4c 00 69 00 74 00 68 00 69 00 6f 00 6e 00 61 |.L.i.t.h.i.o.n.a|

00000010 00 74 00 65 00 20 3d 3e 20 32 0a 00 44 00 45 00 |.t.e. => 2..D.E.|

00000020 53 00 4f 00 58 00 59 00 20 3d 3e 20 34 0a 00 56 |S.O.X.Y. => 4..V|

00000030 00 61 00 70 00 6f 00 6e 00 61 00 20 3d 3e 20 33 |.a.p.o.n.a. => 3|

00000040 0a 00 64 00 69 00 6d 00 65 00 74 00 68 00 6f 00 |..d.i.m.e.t.h.o.|

00000050 78 00 79 00 62 00 65 00 6e 00 7a 00 65 00 6e 00 |x.y.b.e.n.z.e.n.|

00000060 65 00 6d 00 65 00 74 00 68 00 61 00 6e 00 65 00 |e.m.e.t.h.a.n.e.|

00000070 73 00 75 00 6c 00 66 00 6f 00 6e 00 69 00 63 00 |s.u.l.f.o.n.i.c.|

00000080 20 3d 3e 20 32 0a 00 49 00 50 00 4d 00 20 3d 3e | => 2..I.P.M. =>|

00000090 20 31 32 0a 00 62 00 65 00 6e 00 7a 00 6f 00 62 | 12..b.e.n.z.o.b|

Replies are listed 'Best First'.
Re^3: unreadable hash keys
by graff (Chancellor) on Jun 06, 2008 at 22:18 UTC
    Please learn to put <code> and </code> around your code snippets when you post code; this is described in Writeup Formatting Tips, which is easy to find (a link is provided on every post-entry form at PM).

    Have you tried using your hex dump tool on your input data file ("name.txt")? If so, do you find that this file actually contains UTF-16BE data? In that case, you should open the file like this:

    open (CID,'<:encoding(UTF-16BE)', 'name.txt') or die "name.txt: $!";
    If you don't do that, your use of   split /\s+/ will not work as expected/intended, because every "whitespace" character will be treated as occurring in isolation, preceded by a null byte (and followed by whatever the high-byte happens to be for the next UTF-16 character), so any occurrence of two adjacent UTF-16 whitespace characters will create an extra null-byte element in the list returned by split.

    By opening the file with the correct encoding layer, as suggested in the line of code given above, perl will read the character data correctly, and character-based operations will work as expected.