in reply to Decoding UTF-8 - "Cannot decode string with wide characters"
Google and Super Search only bring up dashed hopes.
<update2>
This problem arises from using the wrong conversion routine in PDF::API2 for hex chars representing a string in the PDF. In the case at hand, each printable character is followed by a NULL byte. Using
correctly translates them into a sequence of ASCII chars, each followed by a NULL byte, whiles/(..)/chr(hex($1))/ge; # convert 0x77 0x00 -> w^@
leads to a UTF-8 string - chr() works for UTF-8 too - but 0x77 0x00 isn't the same as 0x7700. It's internal representation is "\347\234\200" (in octal notation) - three bytes. The only way to get it's original value back is using ord - which also works for UTF-8 :-)s/(....)/chr(hex($1))/ge # convert 0x7700 -> ç\234\200
Try this silly sub
# stolen from Data::Dumper and tweaked.. ;-) sub narrow_char { return join('', map {chr(hex $_)} map{ (my $s = sprintf("%x",ord($_)))=~s/00$//; $s; } split//,$_[0] ); } my %info = ( 'CreationDate', 'D:20060817180621+01\'00\'', 'Producer', "\x{4f00}\x{7000}\x{6500}\x{6e00}\x{4f00}\x{6600}\x{6600}\x{6900}\ +x{6300}\x{6500}\x{2e00}\x{6f00}\x{7200}\x{6700}\x{2000}\x{3200}\x{2e0 +0}\x{3000}", 'Creator', "\x{5700}\x{7200}\x{6900}\x{7400}\x{6500}\x{7200}", 'Author', "\x{4200}\x{6f00}\x{6200}\x{2000}\x{5700}\x{6500}\x{6200}\x{7300}\ +x{7400}\x{6500}\x{7200}", 'Title', "\x{4300}\x{4f00}\x{4d00}\x{5000}\x{4500}\x{5400}\x{4900}\x{5400}\ +x{4900}\x{5600}\x{4500}\x{2000}\x{5300}\x{4100}\x{4600}\x{4100}\x{520 +0}\x{4900}", ); foreach my $key (sort keys %info) { print "$key -> "; print narrow_char($info{$key}); print "\n"; } __END__ # output: Author -> Bob Webster CreationDate -> D:20060817180621+01'00' Creator -> Writer Producer -> OpenOffice.org 2.0 Title -> COMPETITIVE SAFARI
whenever decode barfs...
update: Another way (less silly?) (regexp by mtve):
sub narrow_char { $_[0] =~ s/(.)/chr(ord($1)>>8)/eg if (length($_[0]) * 3 == do { use bytes; length $_[0] } ); $_[0]; }
<update2>
Caveat: This routine modifies $_[0] in-place, so it's value is changed in the caller as well.
</update2>
--shmem
update: populated solution with strings from OP
update2: added some explanation
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
|
|---|