perlpacktut "packing text" example and Unicode

glasswalk3r has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks,

I was trying to apply the concepts of Packing Text in the output of a program and it's working fine to split an entire line into a nice array... until I got some unicode characters in the line.

The entire line is being from a UTF-8 file that was correctly read by using open(my $in, '<:utf8', $file), I can check by looking the debugger that the Unicode strings were correctly interpreted (like execu\x{e7}\x{e3}o) but after using the unpack function I just get garbage.

The template I'm using is A12A19A41A10A14A13A19A13A13A13A13A21A13A11A14A14 and since I have exactly 2 spaces separating the fields, the result is what I need, except for the unicode characters "corrupted".

Is there any way to apply the same concept to UTF-8 characters? I have tried using the "U" mask, but without any results.

I'm using Windows XP Service Pack 3 with Active Perl 5.8.9. And yes, I'm using chcp 65000 to get UTF-8 characters in the terminal.

Thanks to all

Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Comment on perlpacktut "packing text" example and Unicode Select or Download Code

Replies are listed 'Best First'.
Re: perlpacktut "packing text" example and Unicode by ikegami (Patriarch) on Jul 13, 2011 at 09:02 UTC
I'm using Windows XP Service Pack 3 with Active Perl 5.8.9. Works fine at least as far back as ActivePerl 5.10.1 build 1007 (the oldest I have). `>perl -E"$_ = qq{execu\x{e7}\x{e3}o}; utf8::downgrade($_); say $_ eq u +npack('A', $_) ?1:0" 1 >perl -E"$_ = qq{execu\x{e7}\x{e3}o}; utf8::upgrade($_); say $_ eq unp +ack('A', $_) ?1:0" 1` [download] IIRC, lots of pack/unpack fixes concerning non-byte strings went into 5.10.	[reply] [d/l]
Re: perlpacktut "packing text" example and Unicode by 7stud (Deacon) on Jul 13, 2011 at 05:25 UTC
The entire line is being from a UTF-8 file that was correctly read by using open(my $in, '<:utf8', $file) You've already "unpacked" the bytes (= 8 bit chunks), which contain integers, into UTF-8 characters after doing that. unpack() is for reading "raw" bytes when you know ahead of time how those raw bytes are laid out in the file. A raw byte is one that has not undergone a translation (:utf8 applies a translation). First, you need to realize that a file contains only integers. A computer can only store numbers--not characters. So characters are represented by integer codes. But if you encounter the integer 120 in a file, how do you know whether that should be the id of a customer(i.e. the actual integer) or the ascii code for the letter 'x'? A byte is an 8 bit chunk of memory that is used to store an integer. unpack() allows you to tell perl exactly what each byte in a file should represent. For instance, you can tell perl that the first 4 bytes represent one integer, the next byte is an integer which represents the ascii code for a character, followed by an undetermined number of bytes which is the UTF-8 integer code for a character, the next 8 bytes after that represent one integer, etc. A file contains only integers, and each integer occupies 1 byte(=8 bits). Furthermore, you can tell perl how to interpret the integers it encounters. If you want to read raw bytes from a file so that you can tell perl exactly how to interpret each byte, you can do that.	[reply]
Re^2: perlpacktut "packing text" example and Unicode by glasswalk3r (Friar) on Jul 14, 2011 at 17:03 UTC
Thank you for the explanation. But assuming I'm already have the data "classified" as UTF-8, is it possible to use "A" patterns with `unpack`? Since the data has a fixed size, `unpack` looked like the most obvious choice do to that. Should I stop reading the file as UTF-8 and using `unpack` with the "raw" data? Which pattern I should use with `unpack` in this cause and how to put the data back as UTF-8 correctly? I checked the documentation of Perl 5.10 and higher and there is a W pattern that should be used for Unicode. But that didn't helped me either Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l] [select]