Re: Perl Encoding

In case you're wondering, Zaxo's reply assumes that:

you're using perl 5.8 and
the string has been read from a file containing utf8 data or
the string has been read in as some other encoding (e.g. GB, Big5, shiftjis, etc) and decoded into utf8

But if all those conditions are met, then it would be sufficient to put "\n" into your regex, because only a "true" new-line will match "\n" in utf8.

For that matter, the non-unicode CJK encodings that I'm aware of are all "variable-width" -- each "character" is either one byte (because it's plain ascii) or two bytes (because it's not ascii), and care has been taken in their various designs to make sure that "\n" (new-line, 0x0a) is never ambiguous -- that is, this byte value is never used as part of a two-byte character -- and likewise for the other ascii control codes (carriage-return, null, tab, etc).

So, using "\n" -- in any version of perl, with utf8 or any non-unicode variable-width Asian encoding -- should be no problem.

The only encoding scheme where you might run into confusion by putting plain-old "\n" into a regex is with UTF16 (whether big-endian or little-endian). If you're dealing with UTF16 data, convert it to utf8 first.

(Just curious: did you have some specific notion in mind about what constitutes a "false newline"?)

Comment on Re: Perl Encoding