I tried to use a UTF-8 non-breaking space (between day and name of month) in the format argument of POSIX::strftime, and hit (with Perl v5.32.0 and a UTF-8-encoded script file, without any non-default encoding settings) upon the following two oddities:

  1. a non-breaking space alone comes out as something unprintable (according to Emacs, Unicode 65533 (decimal) REPLACEMENT CHARACTER, but when examined in a hex-mode, looks like hexadecimal EFBFBD)
  2. when other non-ASCII characters figure in the format, they come out correctly, and this seems "infectious": the non-breaking space then comes out correctly as well! However, in that case, a string that is concatenated to what strftime returns gets garbled (perhaps erroneously encoded from an assumed iso-latin-1 (but really already utf-8) to utf-8), which does not happen in case 1

These behaviours can be demonstrated with the following script (The comments apply to the transparent space character in the format; the innocent-looking - inner, i.e. not syntactical - quotes in lines 3 and 4 are Unicode LEFT and RIGHT SINGLE QUOTATION MARK, the same as in the $string):

use POSIX qw(strftime); $string = 'hailed an über ‘cab’ on '; @t = (0, 0, 0, 23, 5, 2020, 4); print $string . strftime( '%d/%b', @t), "\n"; print $string . strftime( '%d %b', @t), "\n"; # UTF-8 nbsp print $string . strftime('‘%d %b’', @t), "\n"; # UTF-8 nbsp print $string . strftime('‘%d %b’', @t), "\n"; # ASCII space

This outputs (line numbers added):

1 hailed an über ‘cab’ on 23/Jun 2 hailed an über ‘cab’ on 23�Jun 3 hailed an über âcabâ on ‘23 Jun’ 4 hailed an über âcabâ on ‘23 Jun’

Note that

(I have deleted complaints about the wide characters in print for line 3 and 4 for brevity.)

I am guessing, rather vaguely, that this is down to strftime essentially being the C function and the latter not being Unicode-aware and maybe also the way that Perl identifies how strings are encoded and then "upgrades" some so as to harmonise their encodings (in this case under a wrong assumption), but ... :

The behaviour with a non-breaking space alone vs. (also) other non-ASCII characters seems definitely inconsistent. Why is the behaviour different between the non-breaking space and typographical quotation marks, which are all outside the ASCII block?

Also, can anything be done about it, i.e. is it possible to use non-breaking spaces in a format for strftime such that they come out correctly (and without having to resort to inserting extra - likely unwanted - non-ASCII characters), and is it possible to use any non-ASCII character in those format argument without confusing Perl? (Actually, I can think only of non-breaking spaces as useful, but other cultures may very plausibly have other use cases.)


In reply to strftime does not handle Unicode characters in format argument properly (at least, not consistently) by Bruder Savigny

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.