Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

My searches have resulted in information that is not only over my head, but confusing as well. What I would like to do is read a text file that may contain non UTF-8 characters in it, and convert those characters to their proper UTF-8 counterparts, then print out the entire converted file.

Any suggestions where to start or should I abandon this idea altogether?

Replies are listed 'Best First'.
Re: UTF-8
by fokat (Deacon) on Aug 26, 2005 at 18:56 UTC

    Anonymous,

    It might make more sense to use iconv, a command created to do exactly what you want. Likely, simply invoking it at your shell will accomplish what you need, after setting the proper parameters.

    Text::Iconv would let you access that functionality from Perl.

    Take a look here and see if this tool does what you need. If you need further help, please update your question with more details and/or code you've tried.

    Best regards

    -lem, but some call me fokat

Re: UTF-8
by halley (Prior) on Aug 26, 2005 at 21:20 UTC
    may contain non-UTF-8 characters

    Well, the first question to answer would be whether or not you knew what encoding those characters represent? If you don't know the original encoding, it will be hard (or trial and error) to get them converted reliably to the UTF-8 encoding.

    Also see: FMTYEWTK about Characters vs Bytes

    --
    [ e d @ h a l l e y . c c ]

Re: UTF-8
by mpolo (Chaplain) on Aug 27, 2005 at 20:15 UTC
    As mentioned above, the iconv program will do this. If you're on Linux or some other system with the GNU tools, you're all set. If you're on windows, fret not - It's been ported: Link! You do need to know the original encoding of the file, however. If it is a Western-European type file, chances are that it's encoded in iso-8859-1 or iso-8859-15, which adds the € and a couple of extra accented characters (or in Windows-1252, which is almost the same, but not quite, see Wikipedia for more info). If it's Russian or Vietnamese or Ancient Greek, well the encoding is going to be different, and you'll have to look up the right one.