mfunke has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way to confirm that a newline is a true newline in data that has Chinese/Japanese characters?

Replies are listed 'Best First'.
Re: Perl Encoding
by Zaxo (Archbishop) on Jul 23, 2003 at 17:54 UTC

    If your encoding is properly understood by your perl, try matching the unicode properties,

    # IsZl - line separator # IsZp - paragraph separator /\p(IsZl)%/ and print 'True newline in ', $/;

    After Compline,
    Zaxo

Re: Perl Encoding
by graff (Chancellor) on Jul 24, 2003 at 01:10 UTC
    In case you're wondering, Zaxo's reply assumes that:
    1. you're using perl 5.8 and
    2. the string has been read from a file containing utf8 data or
    3. the string has been read in as some other encoding (e.g. GB, Big5, shiftjis, etc) and decoded into utf8

    But if all those conditions are met, then it would be sufficient to put "\n" into your regex, because only a "true" new-line will match "\n" in utf8.

    For that matter, the non-unicode CJK encodings that I'm aware of are all "variable-width" -- each "character" is either one byte (because it's plain ascii) or two bytes (because it's not ascii), and care has been taken in their various designs to make sure that "\n" (new-line, 0x0a) is never ambiguous -- that is, this byte value is never used as part of a two-byte character -- and likewise for the other ascii control codes (carriage-return, null, tab, etc).

    So, using "\n" -- in any version of perl, with utf8 or any non-unicode variable-width Asian encoding -- should be no problem.

    The only encoding scheme where you might run into confusion by putting plain-old "\n" into a regex is with UTF16 (whether big-endian or little-endian). If you're dealing with UTF16 data, convert it to utf8 first.

    (Just curious: did you have some specific notion in mind about what constitutes a "false newline"?)