Note: I have a solution/workaround to my problem, BUT I was wondering if there was an explanation for this issue.

Background:

So I have use utf8 (or more precisely BEGIN{$^H != 8388608}) in this perl script of mine, which extracts a string from an HTML file (decoding entities in the process), replaces/transliterates certain strings/characters, and prints it back out. There was a time when I didn't have the pragma added, and it caused issues -- i.e. I got weird characters like U+FFFD � in output. I'm guessing that this happens (correct me if I'm wrong) because I do have non-ASCII characters in my code (specifically, tr/.../..non-ASCII chars here../s) (rather than escape sequences like \x{...} so that I can easily distinguish the characters), since one of the purposes of the script is to transliterate forbidden printable ASCII characters for filenames (Windows/Linux) in the string into a different Unicode character that is allowed (e.g. ? to ‽, \ to \).

So I think I understand what use utf8 does.

Problem:

But I stumbled across another issue recently involving strings that contained some non-ASCII characters. (Note that I only transliterate a handful of characters, so the vast majority of characters are not replaced for the strings I extract). After parsing this arbitrary string and calling CORE::mkdir on it or CORE::print on it, some non-ASCII characters are messed up and replaced with some other character.

An example of one of the characters that caused issues: ☆.

The HTML page originally contained ☆ (the HTML decimal entity equivalent), which was then converted to ☆ by my html parser.

print returns the character â instead of ☆.

What's interesting is that if I remove the use utf8 ( or BEGIN{$^H |=8388608} ) from the script, the problematic character is printed just fine, BUT basically every other non-letter non-number (ASCII?) character like ! Space & etc. is replaced with the aforementioned � character.

What's also interesting is that if I utf8::downgrade or utf8::decode the string before printing, everything prints fine.

So basically I'm asking if anyone has an explanation for this behavior. Thanks.


In reply to Curious about why some characters cause issues with mkdir/print by YenForYang

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.