Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Unicode, regex's, encodings, and all that (Perl 5.6 and 5.8)

by John M. Dlugosz (Monsignor)
on Dec 24, 2002 at 04:08 UTC ( [id://222033]=perlquestion: print w/replies, xml ) Need Help??

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

In a module, I was wondering whether to use utf8 or not since it affects the regular expressions. In 5.6, the user of the module would have to pass strings of the matching encoding disciplen or it would not work right. But, I read that in 5.8 the regex is polymorphic and will transparently accept either kind of string, so this is not an issue any more.

But, the new perlunicode states,

The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode character scheme when presented with Unicode data--or instead uses a traditional byte scheme when presented with byte data. use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts. {emph. in original}
So, does that mean I still need to use utf8 in scope in order to generate this polymorphic code, or only if the regex uses unicode features such as \x{} literals or enhanced meaning of \w, or what? It seems to be saying two different things here.

And that's not the only place. In encoding, it states, "The pragma is a per script, not a per block lexical. Only the last use encoding or no encoding matters, and it affects the whole script. ... the use of this pragma inside the module is strongly discouraged (because the influence of this pragma lasts not only for the module but the script that uses). But if you have to, make sure you say no encoding at the end of the module so you contain the influence of the pragma within the module. "

So, if you put no encoding at the end of your module's pm file to "contain" it, doesn't that kill any use encoding at the top of the script, since only the last use or no has an effect?

And I would think it would affect the file (e.g. module, required or do'ed step), not the whole script, since it would have to make two passes to make the last (overall) affect the earlier-read files. And for run-time require, that just does not compute.

If you're discouraged from using it inside a module, what good is it? A Greek can't write his reusable code in Greek code page. And if he writes his main file that way, then it will mess up any modules (encoded as Latin-1) that he tries to use. That is so nuts that I can only suppose that the documentation is broken. What's the real story here?

Meanwhile, is use utf8 necessary for extended variable names? use encoding doesn't apply, but I wonder if Perl would take the normal G1 range as letters or (I suppose) as unknowns?

—John

Replies are listed 'Best First'.
Re: Unicode, regex's, encodings, and all that (Perl 5.6 and 5.8)
by Zaxo (Archbishop) on Dec 24, 2002 at 04:30 UTC
    is use utf8 necessary for extended variable names?

    Yes, if utf8 names are used in perl source, you must use utf8;. I gave an example of a utf8 named sub in [5.8.0 Note] A Sub Named Sigma. That code failed without the pragmatic. I wish I could answer the rest of your questions as easily.

    After Compline,
    Zaxo

Re: Unicode, regex's, encodings, and all that (Perl 5.6 and 5.8)
by pg (Canon) on Dec 24, 2002 at 07:57 UTC
    Yes, 'use encoding'/'no encoding' are per script, but not per block lexical. However your interpretation/understanding is wrong. The following examples could help you to understand:
    1. In a package, if you say:
      ... (block 1)
      use encoding "greek";
      ... (block 2)
      no encoding;
      ... (block 3)
      
      The actual effect is encoding in block 2, but no encoding in block 1, and 3. It is not as you expected that, becasue that no encoding comes the last, the whole module is no encoding.
    2. In a package, if you say:
      ... (block 1)
      use encoding "greek";
      ... (block 2)
      use encoding "greek";
      ... (block 3)
      no encoding;
      ... (block 4)
      
      The actual effect is encoding in block 2 and 3, but no encoding in block 1 and 4. Again, according to your interpretation, the whole module is not encoding becasue that 'no encoding' is the last in the physical sequence. And again your interpretation is wrong.
    3. NOW WHAT DOES THAT SENTENCE IN "encoding" documentation MEAN? Let's get to the point:

      In packageA, you say:
      ...(block 1)
      use encoding "greek";
      ...(block 2)
      
      In a script, you say:
      use packageA;
      ...(block3)
      
      In this example, you should expect, block 1 not encoding, block 2 encoding. Block 3? Now this is what the document means: block 3 is encoding, because that use encoding in packageA is not block lexical, but per script, and your script now 'contains' packageA, that use encoding you put in packageA does affect the rest of the script, not just to the end of packageA.
    4. Another example: In packageA, you say:
      ...(block 1)
      use encoding "greek";
      ...(block 2)
      
      In a script, you say:
      use packageA;
      ...(block3)
      no encoding;
      ...(block 4)
      
      It would be encoding in block 2 and 3, no encoding in 1 and 4. The reason block 3 is encoding is that, the 'use encoding' you put in packageA is per script, so it even affects things outside the package, until meets a 'no encoding' later in the script.
    5. Last example, and this is what you are supposed to do in the best practice.

      In packageA, you say:
      ...(block 1)
      use encoding "greek";
      ...(block 2)
      no encoding; # at the end of packageA
      
      In a script, you say:
      use packageA;
      ...(block 3)
      use encoding "greek";
      ...(block 4)
      
      Block 1, 3 not encoing, 2 and 4 encoding, exactly meets what you visually see on the screen, what you see is what you get. So the best practice is to always say "no encoding" at the end of your module, if you said any "use encoding" earlier in your module. This makes sure that your module's encoding does not affect any other module/script unexpectedly.
      Ah,
      The pragma is a per script, not a per block lexical. Only the last use encoding or no encoding matters, and it affects the whole script.
      Really should say something like
      The pragma is not block-scoped like lexicals and strict. Rather, any use/no encoding statement has an immediate effect starting on the next statement in the file. It does not "pop" at the end of the block or current module, but continues on until another use or no encoding statement is encountered.

      FWIW, I don't think I'm "interpreting" the sentence "only the last one matters and affects the whole script" in an odd way; I think it is written incorrectly and says the wrong thing.

      —John

Re: Unicode, regex's, encodings, and all that (Perl 5.6 and 5.8)
by Courage (Parson) on Dec 24, 2002 at 07:35 UTC
    Answering to first your question about "polymorphic code" - I see the only way how to read this.
    Perl *always* produces polymorphic opcode and whether currently unicode features are in effect on a given regular expression depends on "utf8" switch.

    Courage, the Cowardly Dog

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://222033]
Approved by diotalevi
Front-paged by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-28 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found