I am finding Unicode support in Perl hard. Most of my strings are ASCII, so there usually is no trouble. But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.

So I have come up with an assert strategy: during development, I enable my "UTF-8 asserts", so that I verify that strings are flagged as native or as UTF-8 at the places where they should be. This has helped me prevent errors. And that is how I realised that substr() behaves differently.

If I capture those trailing slashes with a regular expression, the (UTF-8/native) flag is preserved. I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.

Say substr sees that all sliced characters are ASCII and sets the "native string" flag. Say my code slices some other path components, some of which do have Unicode characters, so that those strings remain flagged as UTF-8. Let's assume that all those strings are concatenated together afterwards.

Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF-8, concatenation should be faster, shouldn't it?

In any case, is there a good reason why substr should take a 'UTF-8 string' and return a 'native string'? I have heard that other routines do respect the flag.


In reply to Re^2: substr on UTF-8 strings by rdiez
in thread substr on UTF-8 strings by rdiez

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.