Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

UTF-8

by Anonymous Monk
on Aug 26, 2005 at 18:49 UTC ( [id://486985]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

My searches have resulted in information that is not only over my head, but confusing as well. What I would like to do is read a text file that may contain non UTF-8 characters in it, and convert those characters to their proper UTF-8 counterparts, then print out the entire converted file.

Any suggestions where to start or should I abandon this idea altogether?

Replies are listed 'Best First'.
Re: UTF-8
by fokat (Deacon) on Aug 26, 2005 at 18:56 UTC

    Anonymous,

    It might make more sense to use iconv, a command created to do exactly what you want. Likely, simply invoking it at your shell will accomplish what you need, after setting the proper parameters.

    Text::Iconv would let you access that functionality from Perl.

    Take a look here and see if this tool does what you need. If you need further help, please update your question with more details and/or code you've tried.

    Best regards

    -lem, but some call me fokat

Re: UTF-8
by halley (Prior) on Aug 26, 2005 at 21:20 UTC
    may contain non-UTF-8 characters

    Well, the first question to answer would be whether or not you knew what encoding those characters represent? If you don't know the original encoding, it will be hard (or trial and error) to get them converted reliably to the UTF-8 encoding.

    Also see: FMTYEWTK about Characters vs Bytes

    --
    [ e d @ h a l l e y . c c ]

Re: UTF-8
by mpolo (Chaplain) on Aug 27, 2005 at 20:15 UTC
    As mentioned above, the iconv program will do this. If you're on Linux or some other system with the GNU tools, you're all set. If you're on windows, fret not - It's been ported: Link! You do need to know the original encoding of the file, however. If it is a Western-European type file, chances are that it's encoded in iso-8859-1 or iso-8859-15, which adds the € and a couple of extra accented characters (or in Windows-1252, which is almost the same, but not quite, see Wikipedia for more info). If it's Russian or Vietnamese or Ancient Greek, well the encoding is going to be different, and you'll have to look up the right one.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://486985]
Approved by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (1)
As of 2024-04-24 16:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found