Re: unreadable hash keys
by GrandFather (Saint) on Jun 05, 2008 at 22:11 UTC
|
^@ is almost certainly a null and very likely your file has been written as a unicode file without a BOM.
What code did you use to write the file?
Perl is environmentally friendly - it saves trees
| [reply] |
Re: unreadable hash keys
by FunkyMonk (Bishop) on Jun 05, 2008 at 21:49 UTC
|
What code did you use to store your hash table. We can't fix code we can't see!
| [reply] |
Re: unreadable hash keys
by moritz (Cardinal) on Jun 05, 2008 at 21:58 UTC
|
vi shows even non-printable characters. I suspect it's a null byte or something along these lines. Try
hexdump -C $file
to find out what it really is. | [reply] [d/l] |
Re: unreadable hash keys
by mhearse (Chaplain) on Jun 05, 2008 at 22:25 UTC
|
I'm guessing you used DB_FILE or Storable. When writing hashes or datastructures to disk, this modules write the file in a database format. This is done for efficiency. They can't be read easily with and editor. Just read them using the appropriate mehods in the module. In the case of Storable, retrieve(). | [reply] [d/l] |
Re: unreadable hash keys
by graff (Chancellor) on Jun 06, 2008 at 00:31 UTC
|
I would agree with GrandFather's opinion: most likely your hash keys are being saved as UTF-16BE (big-endian), though I have no idea why that is happening. I would expect that this affects the hash values as well (and the line-termination, whether it's LF or CRLF).
You don't see the null bytes when using "more", because your terminal display ignores them.
If you show some code that demonstrates how this file was written, we could show you ways to avoid the problem. In the meantime, you might try passing that data file through a one-liner, like this:
perl -pe 'BEGIN{binmode STDIN,"encoding(UTF-16BE)"; binmode STDOUT,"ut
+f8"}' < file.in > file.out
But first you should follow the other advice given above: inspect the file more carefully with a hexdump tool, to confirm whether or not UTF-16 is being used regularly for every character in the file (not just the hash keys). If my guess is wrong, and only the hash keys are 16-bit characters, the script above would screw things up badly, and a slightly more complicated script would be needed to fix the file. | [reply] [d/l] |
|
|
Thanks for all replies.
Following is the code I used to store hash table.
my $_token = {};
sub gen_tokens{
open (CID,'name.txt') or die "Cannot open name file\n";
while(<CID>){
my @words = split(/\s+/,$_);
foreach my $key (@words){
$_token->{$key}++;
}
}
}
sub print_all_tokens{
my $tokens = shift;
local *FH = shift;
while ( my ($key, $value) = each(%$tokens) ) {
print FH "$key => $value\n";
}
}
I used 'hexdump' to look at my file.I found there is a hex character '00' on ahead of my each letter. It looks like this
00000000 00 4c 00 69 00 74 00 68 00 69 00 6f 00 6e 00 61 |.L.i.t.h.i.o.n.a|
00000010 00 74 00 65 00 20 3d 3e 20 32 0a 00 44 00 45 00 |.t.e. => 2..D.E.|
00000020 53 00 4f 00 58 00 59 00 20 3d 3e 20 34 0a 00 56 |S.O.X.Y. => 4..V|
00000030 00 61 00 70 00 6f 00 6e 00 61 00 20 3d 3e 20 33 |.a.p.o.n.a. => 3|
00000040 0a 00 64 00 69 00 6d 00 65 00 74 00 68 00 6f 00 |..d.i.m.e.t.h.o.|
00000050 78 00 79 00 62 00 65 00 6e 00 7a 00 65 00 6e 00 |x.y.b.e.n.z.e.n.|
00000060 65 00 6d 00 65 00 74 00 68 00 61 00 6e 00 65 00 |e.m.e.t.h.a.n.e.|
00000070 73 00 75 00 6c 00 66 00 6f 00 6e 00 69 00 63 00 |s.u.l.f.o.n.i.c.|
00000080 20 3d 3e 20 32 0a 00 49 00 50 00 4d 00 20 3d 3e | => 2..I.P.M. =>|
00000090 20 31 32 0a 00 62 00 65 00 6e 00 7a 00 6f 00 62 | 12..b.e.n.z.o.b|
| [reply] |
|
|
open (CID,'<:encoding(UTF-16BE)', 'name.txt') or die "name.txt: $!";
If you don't do that, your use of split /\s+/ will not work as expected/intended, because every "whitespace" character will be treated as occurring in isolation, preceded by a null byte (and followed by whatever the high-byte happens to be for the next UTF-16 character), so any occurrence of two adjacent UTF-16 whitespace characters will create an extra null-byte element in the list returned by split.
By opening the file with the correct encoding layer, as suggested in the line of code given above, perl will read the character data correctly, and character-based operations will work as expected. | [reply] [d/l] [select] |
Re: unreadable hash keys
by CountZero (Bishop) on Jun 06, 2008 at 05:15 UTC
|
Don't re-invent the wheel and look at modules such as YAML or JSON to save (and load) your hashes (and other data) into a human-readable format.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |