mohiddinb has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, This is my first post. I need to convert the text into UTF-8 from any charset as of now i am managing to convert from iso-8859-1 into utf-8 but ineed to generalize it so really thankful if any one helps me in this. Thanks & regards, Mohiddin Baig
  • Comment on I need to convert the text into UTF-8 from any charset

Replies are listed 'Best First'.
Re: I need to convert the text into UTF-8 from any charset
by Joost (Canon) on Aug 08, 2007 at 12:10 UTC
      Thank you joost but it seems its not working fine some of the characters are still malformed i have even tried 1)Text::Iconv 2)use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); and finally ur suggestion(Encode) but still getting malformed characters any how thanks for ur help Regards, Mohiddin Baig
Re: I need to convert the text into UTF-8 from any charset
by graff (Chancellor) on Aug 08, 2007 at 13:18 UTC
    Here's a basic script that will run as a "stdin-stdout filter" -- that is, it always reads data from STDIN and prints it to STDOUT, so to run it you always pipe or redirect its input and output, like this
    enc-converter cp1252 < file.txt > file_utf8.txt #or some-process | enc-converter shiftjis | some-utf8-process #or a mix: some-process | enc-converter koi8-r > file_utf8.txt enc-converter iso-8859-1 < file.txt | utf8-process
    Note that you must provide the name of the input encoding as the command-line argument ($ARGV[0]):
    #!/usr/bin/perl use strict; ( @ARGV == 1 and $ARGV[0] =~ /^\w[-\w]+$/ and ! -t ) or die "Usage: $0 inp-enc < file.inp-enc > file.utf8\n"; my $inp_enc = sprintf( ":encoding(%s)", shift ); binmode STDIN, $inp_enc; binmode STDOUT, ":utf8"; print while (<>);
    The Encode manual provides some instructions on how to get a listing of the names of known encodings usable with the ":encoding(...)" technique. This command will print the list:
    perl -MEncode -le 'print for(Encode->encodings(":all"))'
    (In a windows/dos shell, you need to invert the single- and double-quotes.)

    As Joost pointed out, you need to know in advance what the input encoding is, because it would be a lot more work to write code that would guess the input encoding automatically. (This can be done, but you need valid training data for each combination of language + encoding you might encounter in order to build models, then you test each input stream against each model and hope that the best match is the right one.)

    update: As you might expect, given the simplicity of the script shown above, it's not that much more typing just to do character conversion as a perl one-liner:

    perl -CO -pe 'BEGIN{binmode STDIN,":encoding(cp936)"}' < file.txt > fi +le.utf8
    The "-C" option with capital letter "O" sets STDOUT to utf8 (so does "-C2"); the script itself is just the BEGIN block to set the encoding for STDIN; the "-p" option does the rest.
Re: I need to convert the text into UTF-8 from any charset
by ww (Archbishop) on Aug 08, 2007 at 10:07 UTC
    Perhaps you haven't realized, but PM is an educational institution; not a code factory that automagically hands out free samples. While we sometimes do your work for you, we're far more apt -- and often, more able -- to help if you show that you've made some effort. In cases like yours, that means: show us what you've tried, and be a bit more specific if you can: what isn't doing what you expect or want.

    So, please read How do I post a question effectively?; then post your code (as an update to the above).

    And welcome to the Monastery, with this hint: check out the Encode::... family.

Re: I need to convert the text into UTF-8 from any charset
by ForgotPasswordAgain (Vicar) on Aug 08, 2007 at 10:14 UTC
    I think it's impossible in principle to do it generally (a given string is a bunch of numbers that can be valid in more than one charset), but I'd like to see a module that does it well based on whatever heuristics. I see a couple CPAN modules, but I'm not sure how well they work.