Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse the file with fixed length chinese and english data fields ( that looks like "M02 1731580851珠海 WBPX深圳龙发商业有限公司 999999321" but the substr starts giving incorrect output after encountering the first occurrence chinese field. This continues to happen even after the 'use utf8' pragma is used in the script. Is there any way in perl that I can use to read and unpack this record with these different characters???
  • Comment on how to parse english-chinese fixed length data records in perl 5.6

Replies are listed 'Best First'.
Re: how to parse english-chinese fixed length data records in perl 5.6
by blakem (Monsignor) on Sep 27, 2002 at 19:41 UTC
    I'm no expert but this seems to split up the data correctly:
    use utf8; my $data = 'M02 1731580851? 海 WBPX深????????????? 99 +9999321'; my @chars = $data =~ /(.)/sg;
    With the utf8 pragma I get 41 chars, without it I get 65.

    -Blake

Re: how to parse english-chinese fixed length data records in perl 5.6
by seattlejohn (Deacon) on Sep 27, 2002 at 20:18 UTC
    Perl's handling of Unicode in 5.6.x can be a little troublesome. One problem I've encountered is that literal UTF-8 strings may be recognized fine, but UTF-8 strings read in from a file are always treated as sequences of bytes, not UTF-8 characters. (I believe this is the documented behavior, it's just not what you probably want.)

    I've had partial success recreating UTF-8 strings from a series of bytes by using pack/unpack with the U template, though if I remember correctly there were still some glitches I encountered with this approach, especially under 5.6.0 (5.6.1 was a bit better).

    Perl 5.8 is supposed to have much improved Unicode support, and if that's an option for you it might be worth investigating. (Sorry, I don't have any firsthand experience with it yet.)

Re: how to parse english-chinese fixed length data records in perl 5.6
by fglock (Vicar) on Sep 27, 2002 at 19:36 UTC

    "fixed length" is fixed number of "chars" or number of "bytes"?

    They have different meanings in this context.

      fixed number of chars...