In case you're wondering, Zaxo's reply assumes that:
  1. you're using perl 5.8 and
  2. the string has been read from a file containing utf8 data or
  3. the string has been read in as some other encoding (e.g. GB, Big5, shiftjis, etc) and decoded into utf8

But if all those conditions are met, then it would be sufficient to put "\n" into your regex, because only a "true" new-line will match "\n" in utf8.

For that matter, the non-unicode CJK encodings that I'm aware of are all "variable-width" -- each "character" is either one byte (because it's plain ascii) or two bytes (because it's not ascii), and care has been taken in their various designs to make sure that "\n" (new-line, 0x0a) is never ambiguous -- that is, this byte value is never used as part of a two-byte character -- and likewise for the other ascii control codes (carriage-return, null, tab, etc).

So, using "\n" -- in any version of perl, with utf8 or any non-unicode variable-width Asian encoding -- should be no problem.

The only encoding scheme where you might run into confusion by putting plain-old "\n" into a regex is with UTF16 (whether big-endian or little-endian). If you're dealing with UTF16 data, convert it to utf8 first.

(Just curious: did you have some specific notion in mind about what constitutes a "false newline"?)


In reply to Re: Perl Encoding by graff
in thread Perl Encoding by mfunke

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.