in reply to Re^2: 'use' inside or outside of package declaration?
in thread 'use' inside or outside of package declaration?

My current semi-working boilerplate for new programs tends to look like this:
#!/usr/bin/env perl use 5.012; # want unicode strings! use utf8; use strict; use autodie; use warnings; # defer FATAL till runtime use open qw< :std :utf8 >; use charnames qw< :full >; use File::Basename qw< basename >; use Carp qw< carp croak confess cluck >; $0 = basename($0); # shorter messages $| = 1; binmode(DATA, ":utf8"); # give a full stack dump on any untrapped exceptions $SIG{__DIE__} = sub { confess "Uncaught exception: $@" unless $^S; }; # now promote run-time warnings into stackdumped exceptions # *unless* we're in an try block, in which # case just generate a clucking stackdump instead $SIG{__WARN__} = sub { if ($^S) { cluck "Trapped warning: @_" } else { confess "Deadly warning: @_" } };
But that suffers form a couple of bugs. There is a bug in the implementation of autodie that screws up the layers imposed by use open. Witness:
% perl -e 'use open qw(:std :utf8); open(F, ">/tmp/out"); print F "\xD +F"'; wc /tmp/out 0 1 2 /tmp/out % perl -e 'use autodie; use open qw(:std :utf8); open(F, ">/tmp/out"); + print F "\xDF"' ; wc /tmp/out 0 1 1 /tmp/out % perl -e ' use open qw(:std :utf8); use autodie; open(F, ">/tmp/out") +; print F "\xDF"' ; wc /tmp/out 0 1 1 /tmp/out
The other problem is that use utf8 doesn’t really work well on globals, because there are issues with how the package symbol tables are accessed as byte strings. There is also an issue of what to do about something like:
use Weather::El_Niño;
That has to map to the filesystem, and now what do you do? Just use the bytes as they are? Normalize to UTF‑8? Downgrade to Latin1 (which it might have already been)? Did you know that (for very good reasons) the Darwin HSF+ filesystem always converts filenames into NFD, their canonically decomposed form? So be careful when checking filenames!! You can’t just say:
@files = grep { /Niñ/ } glob("{El,La}_*");
Because if you input it in the normal way, your pattern is going to have a U+00F1 LATIN SMALL LETTER N WITH TILDE there (which is the NFC form) but the results from the filesystem will have an "n" followed by U+0303 COMBINING TILDE (the NFD version), which is suddenly two separate code points, not one.

There is a Google Summer of Code project for cleaning up Perl’s tokenizer vis‐à‐vis 8‑bit names, including for UTF‑8. I am convinced that this can and shall be fixed.


I see my stalker is back. Yawn!

Replies are listed 'Best First'.
Re^4: 'use' inside or outside of package declaration?
by choroba (Cardinal) on May 12, 2011 at 12:07 UTC
    Just a note: Darwin HSF+ uses NFD (with some deviations), not NFC. I've been bitten by it, you cannot even do
    touch á cat á
    in the shell.
      Yes, sorry, I wrote NFC but described NFD. I’ll fix it. However, your example of something that doesn’t work, does. Watch:
      % uniquote -v t echo foo > \N{LATIN SMALL LETTER A WITH ACUTE} cat \N{LATIN SMALL LETTER A WITH ACUTE} % sh /tmp/t foo % ls a? | uniquote -v a\N{COMBINING ACUTE ACCENT}
      It has to work that way, because the same NFC conversion takes place for all filenames passed to open. But I do know what you mean. It depends on the syscall. Apparently stat doesn’t do that, since there you have to do it yourself:
      % perl -le 'print -e "\xE1" ? "Yes" : "No"' No % perl -MUnicode::Normalize -le 'print -e NFC("\xE1") ? "Yes" : "No"' Yes


      I see my stalker is back. Yawn!

        OK, I don't have any HFS handy. Maybe it was touch and ls, which would be in accord with what you show about stat.
Re^4: 'use' inside or outside of package declaration?
by John M. Dlugosz (Monsignor) on May 12, 2011 at 12:34 UTC
    I'll study that some more later in the day.

    use strict is redundant since 5.12 includes that.

    As for file names, don't forget Windows uses UTF-16 in its API. The characters has well-defined meanings (not just bytes), but it still suffers from Normalization issues.

    binmode(DATA, ":utf8"); That's not implied by the utf8 pragma?

      I put the strict there even when using 5.12 or above because it documents what’s going on, and because I alas find myself downgrading to 5.10.1 now and again, and don’t want to lose things.

      In general, I’m for several reasons opposed to non‐pragmas diddling their caller’s scope’s lexical hints in non‐obvious ways unrelated to that module’s purpose; that’s just too Acme:: for my tastes. But I do consider use 5.012 a pragma — that is, a compiler declaration that can alter the rules of engagement.

      And no, the necessary binmoding of DATA is triggered neither by use utf8 nor by use open ":utf8". Go figger. 😾

      I have as little to do with UTF‑16 as I possibly can. 🙈 🙉 🙊 Anything that makes me deal with individual code units is such a lose that I just want to kick the people who afflicted the world with this idiocy. Aren’t you glad we don’t have to count code units in Perl? 😹


      I see my stalker is back. Yawn!

        The big question is: Where do you get a font with a "pouting cat face?"
        Never mind, I didn’t have symbola, installed on this computer.

        What are you using to type? I see you have U+2010 HYPHEN instead of the ASCII hyphin-minus character.

        I find wide strings to be a pain in C++/Windows. I suppose it was a good idea at the time of development of NT 3, since UTF-8 was a footnote and not the force for good as we understand it today. Still, at some point they could have made the "code page" for UTF-8 (which is defined) work correctly for the 8-bit char API functions. For all I know it should: it just wraps the native calls around a call to convert, and the convert function handles that one, right?