UTF-8

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

My searches have resulted in information that is not only over my head, but confusing as well. What I would like to do is read a text file that may contain non UTF-8 characters in it, and convert those characters to their proper UTF-8 counterparts, then print out the entire converted file.

Any suggestions where to start or should I abandon this idea altogether?

Comment on UTF-8

Replies are listed 'Best First'.
Re: UTF-8 by fokat (Deacon) on Aug 26, 2005 at 18:56 UTC
Anonymous, It might make more sense to use `iconv`, a command created to do exactly what you want. Likely, simply invoking it at your shell will accomplish what you need, after setting the proper parameters. Text::Iconv would let you access that functionality from Perl. Take a look here and see if this tool does what you need. If you need further help, please update your question with more details and/or code you've tried. Best regards -lem, but some call me fokat	[reply]
Re: UTF-8 by halley (Prior) on Aug 26, 2005 at 21:20 UTC
may contain non-UTF-8 characters Well, the first question to answer would be whether or not you knew what encoding those characters represent? If you don't know the original encoding, it will be hard (or trial and error) to get them converted reliably to the UTF-8 encoding. Also see: FMTYEWTK about Characters vs Bytes -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: UTF-8 by mpolo (Chaplain) on Aug 27, 2005 at 20:15 UTC
As mentioned above, the iconv program will do this. If you're on Linux or some other system with the GNU tools, you're all set. If you're on windows, fret not - It's been ported: Link! You do need to know the original encoding of the file, however. If it is a Western-European type file, chances are that it's encoded in iso-8859-1 or iso-8859-15, which adds the € and a couple of extra accented characters (or in Windows-1252, which is almost the same, but not quite, see Wikipedia for more info). If it's Russian or Vietnamese or Ancient Greek, well the encoding is going to be different, and you'll have to look up the right one.	[reply]