Decoding UTF-8 - "Cannot decode string with wide characters"

kiz has asked for the wisdom of the Perl Monks concerning the following question:

I've a nice simple piece of code that pulls some information out of a pdf file:

#!/path/bin/perl
use Data::Dumper;
use PDF::API2;
use Encode;
my $dir = '/tmp/';
my $file = $dir.'test.pdf';

my $pdf = PDF::API2->open($file);
my %info = $pdf->info;
print Dumper(%info);
foreach my $key (sort keys %info) {
  print "$key ->";
  print decode("utf8", $info{$key});
  print "\n";
}
[download]

Everything is installed, and the code runs. Data Dumper prints out:

$VAR1 = 'CreationDate';
$VAR2 = 'D:20060817180621+01\'00\'';
$VAR3 = 'Producer';
$VAR4 = "\x{4f00}\x{7000}\x{6500}\x{6e00}\x{4f00}\x{6600}\x{6600}\x{69
+00}\x{6300}\x{6500}\x{2e00}\x{6f00}\x{7200}\x{6700}\x{2000}\x{3200}\x
+{2e00}\x{3000}";
$VAR5 = 'Creator';
$VAR6 = "\x{5700}\x{7200}\x{6900}\x{7400}\x{6500}\x{7200}";
$VAR7 = 'Author';
$VAR8 = "\x{4200}\x{6f00}\x{6200}\x{2000}\x{5700}\x{6500}\x{6200}\x{73
+00}\x{7400}\x{6500}\x{7200}";
$VAR9 = 'Title';
$VAR10 = "\x{4300}\x{4f00}\x{4d00}\x{5000}\x{4500}\x{5400}\x{4900}\x{5
+400}\x{4900}\x{5600}\x{4500}\x{2000}\x{5300}\x{4100}\x{4600}\x{4100}\
+x{5200}\x{4900}";
[download]

but decode barfs with "Cannot decode string with wide characters"

(The pdf file was created by exporting an OpenOffice file {.doc format on disk} to pdf.)

I have tested this under Solaris and Fedora Core 5.
Solaris is running with Perl 5.8.3 and Encode 1.99
FC5 is running Perl 5.8.8 and and Encode 2.09 (downgraded from Encode 2.18)

Has anyone got a setup that will handle the UTF-8 (as present in pdfs... all pdfs..?)

-- Ian Stuart
A man depriving some poor village, somewhere, of a first-class idiot.

Comment on Decoding UTF-8 - "Cannot decode string with wide characters" Select or Download Code

Replies are listed 'Best First'.
Re: Decoding UTF-8 - "Cannot decode string with wide characters" by shmem (Chancellor) on Aug 24, 2006 at 17:29 UTC
Ah, those wide chars... Google and Super Search only bring up dashed hopes. <update2> This problem arises from using the wrong conversion routine in PDF::API2 for hex chars representing a string in the PDF. In the case at hand, each printable character is followed by a NULL byte. Using `s/(..)/chr(hex($1))/ge; # convert 0x77 0x00 -> w^@` [download] correctly translates them into a sequence of ASCII chars, each followed by a NULL byte, while `s/(....)/chr(hex($1))/ge # convert 0x7700 -> ç\234\200` [download] leads to a UTF-8 string - `chr()` works for UTF-8 too - but `0x77 0x00` isn't the same as `0x7700`. It's internal representation is "\347\234\200" (in octal notation) - three bytes. The only way to get it's original value back is using ord - which also works for UTF-8 :-) </update2> Try this silly sub # stolen from Data::Dumper and tweaked.. ;-) sub narrow_char { return join('', map {chr(hex $_)} map{ (my $s = sprintf("%x",ord($_)))=~s/00$//; $s; } split//,$_[0] ); } my %info = ( 'CreationDate', 'D:20060817180621+01\'00\'', 'Producer', "\x{4f00}\x{7000}\x{6500}\x{6e00}\x{4f00}\x{6600}\x{6600}\x{6900}\ +x{6300}\x{6500}\x{2e00}\x{6f00}\x{7200}\x{6700}\x{2000}\x{3200}\x{2e0 +0}\x{3000}", 'Creator', "\x{5700}\x{7200}\x{6900}\x{7400}\x{6500}\x{7200}", 'Author', "\x{4200}\x{6f00}\x{6200}\x{2000}\x{5700}\x{6500}\x{6200}\x{7300}\ +x{7400}\x{6500}\x{7200}", 'Title', "\x{4300}\x{4f00}\x{4d00}\x{5000}\x{4500}\x{5400}\x{4900}\x{5400}\ +x{4900}\x{5600}\x{4500}\x{2000}\x{5300}\x{4100}\x{4600}\x{4100}\x{520 +0}\x{4900}", ); foreach my $key (sort keys %info) { print "$key -> "; print narrow_char($info{$key}); print "\n"; } __END__ # output: Author -> Bob Webster CreationDate -> D:20060817180621+01'00' Creator -> Writer Producer -> OpenOffice.org 2.0 Title -> COMPETITIVE SAFARI [download] whenever decode barfs... update: Another way (less silly?) (regexp by mtve): `sub narrow_char { $_[0] =~ s/(.)/chr(ord($1)>>8)/eg if (length($_[0]) * 3 == do { use bytes; length $_[0] } ); $_[0]; }` [download] <update2> Caveat: This routine modifies `$_[0]` in-place, so it's value is changed in the caller as well. </update2> --shmem update: populated solution with strings from OP update2: added some explanation _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: Decoding UTF-8 - "Cannot decode string with wide characters" by jeteve (Pilgrim) on Aug 25, 2006 at 12:57 UTC
Hi ! As you said, your strings seems to be allready in unicode. They seem to be allready sequences of unicode characters and are certainly marked as unicode/utf8 within perl. You can check that with utf8::is_utf8($string) (that should be true in your case). So there's no point decoding them again within perl. decode works on sequence of bytes, not on sequence of characters, and here you got a sequence of unicode characters. I guess what you like to do is print your strings in a UTF-8 terminal. To do so, you simply have to either: print encode("utf-8" , $string); or setting binmode STDOUT, ":utf8" ; To have automatic encoding in utf8 for STDOUT. Hope it helps. unicode/utf8 are often quite confusing. The confusion comes from mixing unicode and utf8 concepts. Unicode is a standard for universal character representation. You can view it as the format of internal strings in perl when an input string needs that kind of characters. So a string of unicode character is a string of characters. utf8 is a _way_ to encode unicode strings. A utf8 string is a string of bytes, and should be used exclusively for I/O. encode_utf8($string) transforms an internal perl string (unicode or not) into a sequence of utf8 bytes. decode_utf8($string) transforms a sequence of utf8 bytes into a unicode internal perl string. -- Nice photos of naked perl sources here !	[reply]
Re^2: Decoding UTF-8 - "Cannot decode string with wide characters" by shmem (Chancellor) on Aug 25, 2006 at 14:25 UTC
The problem is that those strings were represented as hex bytes which were read as unicode, but aren't. They are char+NULL sequences, e.g `"p\0e\0r\0l\0"` for "perl", as can be seen here. So, they must be converted back using ord and (via hex and `s/00$//`, or `$_>>8`) forth into the right format with chr. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]