Note: I have a solution/workaround to my problem, BUT I was wondering if there was an explanation for this issue.
Background:
So I have use utf8 (or more precisely BEGIN{$^H != 8388608}) in this perl script of mine, which extracts a string from an HTML file (decoding entities in the process), replaces/transliterates certain strings/characters, and prints it back out. There was a time when I didn't have the pragma added, and it caused issues -- i.e. I got weird characters like U+FFFD � in output. I'm guessing that this happens (correct me if I'm wrong) because I do have non-ASCII characters in my code (specifically, tr/.../..non-ASCII chars here../s) (rather than escape sequences like \x{...} so that I can easily distinguish the characters), since one of the purposes of the script is to transliterate forbidden printable ASCII characters for filenames (Windows/Linux) in the string into a different Unicode character that is allowed (e.g. ? to ‽, \ to \).
So I think I understand what use utf8 does.
Problem:
But I stumbled across another issue recently involving strings that contained some non-ASCII characters. (Note that I only transliterate a handful of characters, so the vast majority of characters are not replaced for the strings I extract). After parsing this arbitrary string and calling CORE::mkdir on it or CORE::print on it, some non-ASCII characters are messed up and replaced with some other character.
An example of one of the characters that caused issues: ☆.
The HTML page originally contained ☆ (the HTML decimal entity equivalent), which was then converted to ☆ by my html parser.
print returns the character â instead of ☆.
What's interesting is that if I remove the use utf8 ( or BEGIN{$^H |=8388608} ) from the script, the problematic character is printed just fine, BUT basically every other non-letter non-number (ASCII?) character like ! Space & etc. is replaced with the aforementioned � character.
What's also interesting is that if I utf8::downgrade or utf8::decode the string before printing, everything prints fine.
So basically I'm asking if anyone has an explanation for this behavior. Thanks.
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |