Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^2: Malformed UTF-8 character

by BillKSmith (Monsignor)
on Dec 01, 2022 at 19:09 UTC ( [id://11148483] : note . print w/replies, xml ) Need Help??


in reply to Re: Malformed UTF-8 character
in thread Malformed UTF-8 character

You have put me on the right track. I have found gvim commands to tell it that input is in CP1252 and output should be in utf-8. This converts the 96 to e28093 (u-2013 EN DASH). The resulting file runs in perl and pastes pack into perlmonks correctly. The character still does not display correctly in gvim or the windows command prompt. The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.
Bill

Replies are listed 'Best First'.
Re^3: Malformed UTF-8 character
by pryrt (Abbot) on Dec 01, 2022 at 19:44 UTC
    The best solution probably it to download notepad++, but it seems like overkill to learn another editor to solve such a rare problem.

    As much as it pains me to say it (given my Notepad++ fandom), it does seem like overkill. But iconv.exe comes with my Strawberry perl... and if it does with yours, then it can handle the translation. (Or gnuwin32's iconv). I believe one of the following two would properly translate the CP1252 encoding of the emdash into UTF-8.

    iconv -f ISO-8859-1 -t utf-8 savedfile > outfile.pl iconv -f CP1252 -t utf-8 savedfile > outfile.pl

    (Of course, the other fix is to not use utf8; after you download the script; perl will default to your native Windows encoding {if I understand things correctly}, so that should work -- at least, it did for me from that same downloaded source code.)

      Thanks for the reminder to check Strawberry (and perl) utilities occasionally.

      Note: The character in question (\x96) is one of the differences between ISO-8859-1 and CP1252. See the difference in the character starting at location 12. The savedfile is from the original post Regex: matching any Number then a hyphen.

      C:\Users\Bill\forums\monks>xxd savedfile 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 9620 4261 720d 0a39 3939 392e 2042 8 . Bar..9999. B 00000020: 617a 0d0a az.. C:\Users\Bill\forums\monks>iconv -f ISO-8859-1 -t utf-8 savedfile > ou +tfile.txt C:\Users\Bill\forums\monks>xxd outfile.txt 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 c296 2042 6172 0d0a 3939 3939 2e20 8 .. Bar..9999. 00000020: 4261 7a0d 0a Baz.. C:\Users\Bill\forums\monks>iconv -f CP1252 -t utf-8 savedfile > outfil +e.txt C:\Users\Bill\forums\monks>xxd outfile.txt 00000000: 3132 3334 202d 2046 6f6f 0d0a 3536 3737 1234 - Foo..5677 00000010: 3820 e280 9320 4261 720d 0a39 3939 392e 8 ... Bar..9999. 00000020: 2042 617a 0d0a Baz..

      Life was so much easier fifty years ago. Oh, there really were two keypunch codes.

      Bill
        > Life was so much easier fifty years ago

        If your native tongue didn't look like "šířící žďáření"...

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^3: Malformed UTF-8 character
by soonix (Canon) on Dec 02, 2022 at 09:04 UTC
    if you use utf8 only for strings and not for variable names, you could convert your special characters to \N{...} notation, such als "\N{EN DASH}"

    Of course, it's your decision whether "Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk" is more readable than something like "Bj�rk" or not ;-)

    N.B.: For the \N escape to work in Perl older than 5.16, you need an explicit use charnames;
      soonix wrote the following, in reply to BillKSmith,
      > if you use utf8 only for strings.... it's your decision

      Actually, 1nickt wrote the code that BillKSmith was trying to run. It was not a decision on BillKSmith's part at all; he just downloaded code from perlmonks, expecting code from a longstanding monk to run without edits. And because perlmonks sends ISO-8859-1 encoding, not UTF-8, then code that is served as ISO-8859-1 will Save As a file encoded with ISO-8859-1. And then because there was a use utf8; in the code that perlmonks serves as ISO-8859-1, the perl executable gives the "Malformed UTF-8 character" message because of the mismatch between the file encoding and the pragma.

      The best would be if perlmonks would serve posts and [download]s as UTF-8, or at least give us an option for it to do so. The next best is for the monk who [download]s the code to convert the file (whether by iconv or a perl oneliner¤ or by a text editor that can change a file's encoding) before running. The suggestion that requires the most effort so far would be for the monk who [download]s the code from perlmonks to have to search through every piece of code they download from perlmonks that has use utf8; and check to make sure that the code isn't actually relying on it, and either commenting out that pragma if it's not actually needed (as I hinted at earlier) or changing every non-ASCII character in a quote from the actual character to a named character.

      ¤: oneliner = perl -pi -MEncode=encode,decode -e "$_ = encode('utf-8', decode('iso-8859-1', $_));" save-as.pl

      I use the \N{} notation frequently. At the time that I opened this thread, I did not know what unicode character the \x96 was meant to represent.
      Bill

        G'day Bill,

        "I did not know what unicode character the \x96 was meant to represent."

        A quick way to determine this is via "Unicode Character Code Charts" — it has "Find chart by hex code:" near the top of the page.

        [Aside: Although that's a standard URL, I noted, when checking it, that it has: "Unicode 15.0 Character Code Charts". I thought that I'd just mention that Perl does a pretty good job of supporting the latest Unicode versions. Perl v5.36.0 (released in May this year) supports Unicode 14.0 (the current version at the time); if you're desperate for 15.0 support, it was added in v5.37.5 (or just wait for 5.38.0 to be released in May next year, or thereabouts).]

        That will give you the name, <control>, and the informative alias, START OF GUARDED AREA; you can use the latter in \N{}.

        $ perl -E 'say sprintf "%x", ord("\N{START OF GUARDED AREA}")' 96

        In a script or one-liner, you can use Unicode::UCD, but it's not always straightforward. Compare:

        $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{name}' DIGIT FOUR $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x34)->{unicode10} || +"<blank>"' <blank> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{name}' <control> $ perl -MUnicode::UCD=charinfo -E 'say charinfo(0x96)->{unicode10} || +"<blank>"' START OF GUARDED AREA

        — Ken