in reply to unreadable hash keys

I would agree with GrandFather's opinion: most likely your hash keys are being saved as UTF-16BE (big-endian), though I have no idea why that is happening. I would expect that this affects the hash values as well (and the line-termination, whether it's LF or CRLF).

You don't see the null bytes when using "more", because your terminal display ignores them.

If you show some code that demonstrates how this file was written, we could show you ways to avoid the problem. In the meantime, you might try passing that data file through a one-liner, like this:

perl -pe 'BEGIN{binmode STDIN,"encoding(UTF-16BE)"; binmode STDOUT,"ut +f8"}' < file.in > file.out
But first you should follow the other advice given above: inspect the file more carefully with a hexdump tool, to confirm whether or not UTF-16 is being used regularly for every character in the file (not just the hash keys). If my guess is wrong, and only the hash keys are 16-bit characters, the script above would screw things up badly, and a slightly more complicated script would be needed to fix the file.

Replies are listed 'Best First'.
Re^2: unreadable hash keys
by nujgnahz (Initiate) on Jun 06, 2008 at 17:05 UTC
    Thanks for all replies.

    Following is the code I used to store hash table.

    my $_token = {};

    sub gen_tokens{

    open (CID,'name.txt') or die "Cannot open name file\n";

    while(<CID>){

    my @words = split(/\s+/,$_);

    foreach my $key (@words){

    $_token->{$key}++;

    }

    }

    }

    sub print_all_tokens{

    my $tokens = shift;

    local *FH = shift;

    while ( my ($key, $value) = each(%$tokens) ) {

    print FH "$key => $value\n";

    }

    }

    I used 'hexdump' to look at my file.I found there is a hex character '00' on ahead of my each letter.

    It looks like this

    00000000 00 4c 00 69 00 74 00 68 00 69 00 6f 00 6e 00 61 |.L.i.t.h.i.o.n.a|

    00000010 00 74 00 65 00 20 3d 3e 20 32 0a 00 44 00 45 00 |.t.e. => 2..D.E.|

    00000020 53 00 4f 00 58 00 59 00 20 3d 3e 20 34 0a 00 56 |S.O.X.Y. => 4..V|

    00000030 00 61 00 70 00 6f 00 6e 00 61 00 20 3d 3e 20 33 |.a.p.o.n.a. => 3|

    00000040 0a 00 64 00 69 00 6d 00 65 00 74 00 68 00 6f 00 |..d.i.m.e.t.h.o.|

    00000050 78 00 79 00 62 00 65 00 6e 00 7a 00 65 00 6e 00 |x.y.b.e.n.z.e.n.|

    00000060 65 00 6d 00 65 00 74 00 68 00 61 00 6e 00 65 00 |e.m.e.t.h.a.n.e.|

    00000070 73 00 75 00 6c 00 66 00 6f 00 6e 00 69 00 63 00 |s.u.l.f.o.n.i.c.|

    00000080 20 3d 3e 20 32 0a 00 49 00 50 00 4d 00 20 3d 3e | => 2..I.P.M. =>|

    00000090 20 31 32 0a 00 62 00 65 00 6e 00 7a 00 6f 00 62 | 12..b.e.n.z.o.b|

      Please learn to put <code> and </code> around your code snippets when you post code; this is described in Writeup Formatting Tips, which is easy to find (a link is provided on every post-entry form at PM).

      Have you tried using your hex dump tool on your input data file ("name.txt")? If so, do you find that this file actually contains UTF-16BE data? In that case, you should open the file like this:

      open (CID,'<:encoding(UTF-16BE)', 'name.txt') or die "name.txt: $!";
      If you don't do that, your use of   split /\s+/ will not work as expected/intended, because every "whitespace" character will be treated as occurring in isolation, preceded by a null byte (and followed by whatever the high-byte happens to be for the next UTF-16 character), so any occurrence of two adjacent UTF-16 whitespace characters will create an extra null-byte element in the list returned by split.

      By opening the file with the correct encoding layer, as suggested in the line of code given above, perl will read the character data correctly, and character-based operations will work as expected.