in reply to how to parse english-chinese fixed length data records in perl 5.6

Perl's handling of Unicode in 5.6.x can be a little troublesome. One problem I've encountered is that literal UTF-8 strings may be recognized fine, but UTF-8 strings read in from a file are always treated as sequences of bytes, not UTF-8 characters. (I believe this is the documented behavior, it's just not what you probably want.)

I've had partial success recreating UTF-8 strings from a series of bytes by using pack/unpack with the U template, though if I remember correctly there were still some glitches I encountered with this approach, especially under 5.6.0 (5.6.1 was a bit better).

Perl 5.8 is supposed to have much improved Unicode support, and if that's an option for you it might be worth investigating. (Sorry, I don't have any firsthand experience with it yet.)

  • Comment on Re: how to parse english-chinese fixed length data records in perl 5.6