in reply to Re: substr on UTF-8 strings (updated)
in thread substr on UTF-8 strings
I am finding Unicode support in Perl hard. Most of my strings are ASCII, so there usually is no trouble. But then a Unicode character comes up, and suddenly writing text to stdout produces garbage characters and Perl issues a warning about it.
So I have come up with an assert strategy: during development, I enable my "UTF-8 asserts", so that I verify that strings are flagged as native or as UTF-8 at the places where they should be. This has helped me prevent errors. And that is how I realised that substr() behaves differently.
If I capture those trailing slashes with a regular expression, the (UTF-8/native) flag is preserved. I think I will code the removal of trailing slashes with a regular expression, as that should respect the flag.
Say substr sees that all sliced characters are ASCII and sets the "native string" flag. Say my code slices some other path components, some of which do have Unicode characters, so that those strings remain flagged as UTF-8. Let's assume that all those strings are concatenated together afterwards.
Perl will then have a mixture of 'native' and 'UTF-8' strings to concatenate. How does that work? Even if there are no characters above 127, Perl will have to scan all 'native' strings, if only to issue a warning for high characters. Is that right? If all strings were flagged as UTF-8, concatenation should be faster, shouldn't it?
In any case, is there a good reason why substr should take a 'UTF-8 string' and return a 'native string'? I have heard that other routines do respect the flag.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: substr on UTF-8 strings
by haukex (Archbishop) on Jun 24, 2020 at 12:46 UTC | |
by ikegami (Patriarch) on Jun 26, 2020 at 19:35 UTC | |
by rdiez (Acolyte) on Jun 24, 2020 at 13:14 UTC | |
by haukex (Archbishop) on Jun 24, 2020 at 13:36 UTC | |
by ikegami (Patriarch) on Jun 26, 2020 at 19:50 UTC | |
|
Re^3: substr on UTF-8 strings
by Your Mother (Archbishop) on Jun 26, 2020 at 19:46 UTC | |
|
Re^3: substr on UTF-8 strings
by ikegami (Patriarch) on Jun 26, 2020 at 19:33 UTC |