Re^3: Malformed UTF-8 character
by pryrt (Abbot) on Dec 01, 2022 at 19:44 UTC
|
The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.
As much as it pains me to say it (given my Notepad++ fandom), it does seem like overkill. But iconv.exe comes with my Strawberry perl... and if it does with yours, then it can handle the translation. (Or gnuwin32's iconv). I believe one of the following two would properly translate the CP1252 encoding of the emdash into UTF-8.
iconv -f ISO-8859-1 -t utf-8 savedfile > outfile.pl
iconv -f CP1252 -t utf-8 savedfile > outfile.pl
(Of course, the other fix is to not use utf8; after you download the script; perl will default to your native Windows encoding {if I understand things correctly}, so that should work -- at least, it did for me from that same downloaded source code.) | [reply] [d/l] [select] |
|
|
C:\Users\Bill\forums\monks>xxd savedfile
00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677
00000010: 3820 9620 4261 720d 0a39 3939 392e 2042 8 . Bar..9999. B
00000020: 617a 0d0a az..
C:\Users\Bill\forums\monks>iconv -f ISO-8859-1 -t utf-8 savedfile > ou
+tfile.txt
C:\Users\Bill\forums\monks>xxd outfile.txt
00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677
00000010: 3820 c296 2042 6172 0d0a 3939 3939 2e20 8 .. Bar..9999.
00000020: 4261 7a0d 0a Baz..
C:\Users\Bill\forums\monks>iconv -f CP1252 -t utf-8 savedfile > outfil
+e.txt
C:\Users\Bill\forums\monks>xxd outfile.txt
00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677
00000010: 3820 e280 9320 4261 720d 0a39 3939 392e 8 ... Bar..9999.
00000020: 2042 617a 0d0a Baz..
Life was so much easier fifty years ago. Oh, there really were two keypunch codes.
| [reply] [d/l] |
|
|
| [reply] [d/l] |
Re^3: Malformed UTF-8 character
by soonix (Chancellor) on Dec 02, 2022 at 09:04 UTC
|
if you use utf8 only for strings and not for variable names, you could convert your special characters to \N{...} notation, such als "\N{EN DASH}"
Of course, it's your decision whether "Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk" is more readable than something like "Bj�rk" or not ;-)
N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit use charnames;
| [reply] [d/l] [select] |
|
|
soonix wrote the following, in reply to BillKSmith,
> if you use utf8 only for strings.... it's your decision
Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a use utf8; in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma.
The best would be if perlmonks would serve posts and [download]s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who [download]s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who [download]s the code from perlmonks to have to search through every piece of code they download from perlmonks that has use utf8; and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character.
¤: oneliner = perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl
| [reply] [d/l] [select] |
|
|
I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
| [reply] |
|
|
G'day Bill,
"I did not know what unicode character the \x96 was meant to represent."
A quick way to determine this is via "Unicode Character Code Charts" —
it has "Find chart by hex code:" near the top of the page.
[Aside:
Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts".
I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions.
Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time);
if you're desperate for 15.0 support, it was added in
v5.37.5
(or just wait for 5.38.0 to be released in May next year, or thereabouts).]
That will give you the name, <control>, and the informative alias, START OF GUARDED AREA;
you can use the latter in \N{}.
$ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")'
96
In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward.
Compare:
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}'
DIGIT FOUR
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} ||
+"<blank>"'
<blank>
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}'
<control>
$ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} ||
+"<blank>"'
START OF GUARDED AREA
| [reply] [d/l] [select] |
|
|