I think Word is probably half-responsible for the mangling here. If it's trying to display each byte, then it means it's not actually reading it as UTF-8, but in some other encoding!
I'll give an example using Windows. First, here's utf8.pl:
# U+73E0 ("pearl") print "\xe7\x8f\xa0";
Now, I execute that and redirect it to both utf8.html and utf8.txt.
Chrome displays the character correctly, because it assumes UTF-8 by default. Notepad also appears smart enough to guess the encoding.
On my system at least, opening the file with Word prompts me to select the encoding; and by default, it guesses UTF-8 and renders the character correctly. Note that if I pick "Windows (Default)" or "MS-DOS", I get garbage.
So try messing with Word a bit; if you use the File -> Open menu (instead of just opening the file from Explorer directly), you can get additional conversion options (sometimes!).
Anne
In reply to Re^7: Parsing a .xlsx file with chinese characters
by anneli
in thread Parsing a .xlsx file with chinese characters
by Sithiris
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |