glasswalk3r has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks,

I was trying to apply the concepts of Packing Text in the output of a program and it's working fine to split an entire line into a nice array... until I got some unicode characters in the line.

The entire line is being from a UTF-8 file that was correctly read by using open(my $in, '<:utf8', $file), I can check by looking the debugger that the Unicode strings were correctly interpreted (like execu\x{e7}\x{e3}o) but after using the unpack function I just get garbage.

The template I'm using is A12A19A41A10A14A13A19A13A13A13A13A21A13A11A14A14 and since I have exactly 2 spaces separating the fields, the result is what I need, except for the unicode characters "corrupted".

Is there any way to apply the same concept to UTF-8 characters? I have tried using the "U" mask, but without any results.

I'm using Windows XP Service Pack 3 with Active Perl 5.8.9. And yes, I'm using chcp 65000 to get UTF-8 characters in the terminal.

Thanks to all

Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Replies are listed 'Best First'.
Re: perlpacktut "packing text" example and Unicode
by ikegami (Patriarch) on Jul 13, 2011 at 09:02 UTC

    I'm using Windows XP Service Pack 3 with Active Perl 5.8.9.

    Works fine at least as far back as ActivePerl 5.10.1 build 1007 (the oldest I have).

    >perl -E"$_ = qq{execu\x{e7}\x{e3}o}; utf8::downgrade($_); say $_ eq u +npack('A*', $_) ?1:0" 1 >perl -E"$_ = qq{execu\x{e7}\x{e3}o}; utf8::upgrade($_); say $_ eq unp +ack('A*', $_) ?1:0" 1

    IIRC, lots of pack/unpack fixes concerning non-byte strings went into 5.10.

Re: perlpacktut "packing text" example and Unicode
by 7stud (Deacon) on Jul 13, 2011 at 05:25 UTC
    The entire line is being from a UTF-8 file that was correctly read by using open(my $in, '<:utf8', $file)

    You've already "unpacked" the bytes (= 8 bit chunks), which contain integers, into UTF-8 characters after doing that.

    unpack() is for reading "raw" bytes when you know ahead of time how those raw bytes are laid out in the file. A raw byte is one that has not undergone a translation (:utf8 applies a translation). First, you need to realize that a file contains only integers. A computer can only store numbers--not characters. So characters are represented by integer codes. But if you encounter the integer 120 in a file, how do you know whether that should be the id of a customer(i.e. the actual integer) or the ascii code for the letter 'x'?

    A byte is an 8 bit chunk of memory that is used to store an integer. unpack() allows you to tell perl exactly what each byte in a file should represent. For instance, you can tell perl that the first 4 bytes represent one integer, the next byte is an integer which represents the ascii code for a character, followed by an undetermined number of bytes which is the UTF-8 integer code for a character, the next 8 bytes after that represent one integer, etc.

    A file contains only integers, and each integer occupies 1 byte(=8 bits). Furthermore, you can tell perl how to interpret the integers it encounters. If you want to read raw bytes from a file so that you can tell perl exactly how to interpret each byte, you can do that.

      Thank you for the explanation. But assuming I'm already have the data "classified" as UTF-8, is it possible to use "A" patterns with unpack? Since the data has a fixed size, unpack looked like the most obvious choice do to that.

      Should I stop reading the file as UTF-8 and using unpack with the "raw" data? Which pattern I should use with unpack in this cause and how to put the data back as UTF-8 correctly?

      I checked the documentation of Perl 5.10 and higher and there is a W pattern that should be used for Unicode. But that didn't helped me either

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill