Re^2: Malformed UTF-8 character

Replies are listed 'Best First'.
Re^3: Malformed UTF-8 character by pryrt (Abbot) on Dec 01, 2022 at 19:44 UTC
The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem. As much as it pains me to say it (given my Notepad++ fandom), it does seem like overkill. But `iconv.exe` comes with my Strawberry perl... and if it does with yours, then it can handle the translation. (Or gnuwin32's iconv). I believe one of the following two would properly translate the CP1252 encoding of the emdash into UTF-8. `iconv -f ISO-8859-1 -t utf-8 savedfile > outfile.pl iconv -f CP1252 -t utf-8 savedfile > outfile.pl` [download] (Of course, the other fix is to not `use utf8;` after you download the script; perl will default to your native Windows encoding {if I understand things correctly}, so that should work -- at least, it did for me from that same downloaded source code.)	[reply] [d/l] [select]
Re^4: Malformed UTF-8 character by BillKSmith (Monsignor) on Dec 02, 2022 at 16:25 UTC
Thanks for the reminder to check Strawberry (and perl) utilities occasionally. Note: The character in question (\x96) is one of the differences between ISO-8859-1 and CP1252. See the difference in the character starting at location 12. The savedfile is from the original post Regex: matching any Number then a hyphen. C:\Users\Bill\forums\monks>xxd savedfile 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 9620 4261 720d 0a39 3939 392e 2042 8 . Bar..9999. B 00000020: 617a 0d0a az.. C:\Users\Bill\forums\monks>iconv -f ISO-8859-1 -t utf-8 savedfile > ou +tfile.txt C:\Users\Bill\forums\monks>xxd outfile.txt 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 c296 2042 6172 0d0a 3939 3939 2e20 8 .. Bar..9999. 00000020: 4261 7a0d 0a Baz.. C:\Users\Bill\forums\monks>iconv -f CP1252 -t utf-8 savedfile > outfil +e.txt C:\Users\Bill\forums\monks>xxd outfile.txt 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 e280 9320 4261 720d 0a39 3939 392e 8 ... Bar..9999. 00000020: 2042 617a 0d0a Baz.. [download] Life was so much easier fifty years ago. Oh, there really were two keypunch codes. Bill	[reply] [d/l]
Re^5: Malformed UTF-8 character by choroba (Cardinal) on Dec 02, 2022 at 23:22 UTC
> Life was so much easier fifty years ago If your native tongue didn't look like "šířící žďáření"... `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^3: Malformed UTF-8 character by soonix (Chancellor) on Dec 02, 2022 at 09:04 UTC
if you `use utf8` only for strings and not for variable names, you could convert your special characters to `\N{...}` notation, such als `"\N{EN DASH}"` Of course, it's your decision whether `"Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk"` is more readable than something like "Bj�rk" or not ;-) N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit `use charnames;`	[reply] [d/l] [select]
Re^4: Malformed UTF-8 character by pryrt (Abbot) on Dec 02, 2022 at 15:34 UTC
soonix wrote the following, in reply to BillKSmith, > if you `use utf8` only for strings.... it's your decision Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a `use utf8;` in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma. The best would be if perlmonks would serve posts and `[download]`s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who `[download]`s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who `[download]`s the code from perlmonks to have to search through every piece of code they download from perlmonks that has `use utf8;` and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character. ¤: oneliner = `perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl`	[reply] [d/l] [select]
Re^4: Malformed UTF-8 character by BillKSmith (Monsignor) on Dec 02, 2022 at 17:08 UTC
I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent. Bill	[reply]
Re^5: Malformed UTF-8 character by kcott (Archbishop) on Dec 03, 2022 at 04:45 UTC
G'day Bill, "I did not know what unicode character the \x96 was meant to represent." A quick way to determine this is via "Unicode Character Code Charts" — it has "Find chart by hex code:" near the top of the page. [Aside: Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts". I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions. Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time); if you're desperate for 15.0 support, it was added in v5.37.5 (or just wait for 5.38.0 to be released in May next year, or thereabouts).] That will give you the name, `<control>`, and the informative alias, `START OF GUARDED AREA`; you can use the latter in `\N{}`. `$ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")' 96` [download] In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward. Compare: `$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}' DIGIT FOUR $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} \|\| +"<blank>"' <blank> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}' <control> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} \|\| +"<blank>"' START OF GUARDED AREA` [download] — Ken	[reply] [d/l] [select]
Re^6: Malformed UTF-8 character by BillKSmith (Monsignor) on Dec 03, 2022 at 13:39 UTC