But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

Add use open qw/:std :utf8/; at the top of your code to open STDIN/OUT/ERR as UTF-8 (assuming your console is UTF-8).

I verify that strings are flagged as native or as UTF-8 at the places where they should be.

You should only be checking the UTF8 flag for debugging purposes when you find problems with your code, and not assuming what its state should be - it's an internal flag that can (and will) change across Perl versions.

Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF8, concatenation should be faster, shouldn't it?

I think you're worrying too much about the internals here. Perl generally does the right thing; you should only worry about it if you actually have problems with your code (write tests to check the input and output of your code), and you should only worry about speed if it becomes an issue for you.

In general, for the best Unicode support, use the latest version of Perl (5.26 is pretty good, but think about upgrading using e.g. perlbrew), encode your source files as UTF-8, and start them off like this:

#!/usr/bin/env perl use warnings; use 5.026; # or higher, enables "unicode_strings" features etc. use utf8; # source code is encoded with UTF-8 use open qw/:std :utf8/; # make UTF-8 default encoding, incl STDIN+OUT use warnings FATAL => 'utf8'; # optional

And make sure to always specify the correct encoding when opening files ("open" Best Practices). If you have problems, feel free to post them here, see also my advice on that here.

I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

See the core module File::Spec for how to do operations on filenames in a portable way.

Update: Added "use warnings FATAL => 'utf8';"


In reply to Re^3: substr on UTF-8 strings by haukex
in thread substr on UTF-8 strings by rdiez

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.