In a module, I was wondering whether to use utf8 or not since it affects the regular expressions. In 5.6, the user of the module would have to pass strings of the matching encoding disciplen or it would not work right. But, I read that in 5.8 the regex is polymorphic and will transparently accept either kind of string, so this is not an issue any more.

But, the new perlunicode states,

The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode character scheme when presented with Unicode data--or instead uses a traditional byte scheme when presented with byte data. use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts. {emph. in original}
So, does that mean I still need to use utf8 in scope in order to generate this polymorphic code, or only if the regex uses unicode features such as \x{} literals or enhanced meaning of \w, or what? It seems to be saying two different things here.

And that's not the only place. In encoding, it states, "The pragma is a per script, not a per block lexical. Only the last use encoding or no encoding matters, and it affects the whole script. ... the use of this pragma inside the module is strongly discouraged (because the influence of this pragma lasts not only for the module but the script that uses). But if you have to, make sure you say no encoding at the end of the module so you contain the influence of the pragma within the module. "

So, if you put no encoding at the end of your module's pm file to "contain" it, doesn't that kill any use encoding at the top of the script, since only the last use or no has an effect?

And I would think it would affect the file (e.g. module, required or do'ed step), not the whole script, since it would have to make two passes to make the last (overall) affect the earlier-read files. And for run-time require, that just does not compute.

If you're discouraged from using it inside a module, what good is it? A Greek can't write his reusable code in Greek code page. And if he writes his main file that way, then it will mess up any modules (encoded as Latin-1) that he tries to use. That is so nuts that I can only suppose that the documentation is broken. What's the real story here?

Meanwhile, is use utf8 necessary for extended variable names? use encoding doesn't apply, but I wonder if Perl would take the normal G1 range as letters or (I suppose) as unknowns?

—John


In reply to Unicode, regex's, encodings, and all that (Perl 5.6 and 5.8) by John M. Dlugosz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.