I need to convert the text into UTF-8 from any charset

mohiddinb has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: I need to convert the text into UTF-8 from any charset by Joost (Canon) on Aug 08, 2007 at 12:10 UTC
It's simple: `use Encode qw(decode); my $input_string = 'some octets'; my $input_encoding = 'cp1250'; # or 'iso-8859-1', or 'shiftjis' or wh +atever my $utf8 = decode($input_encoding,$input_string);` [download] Note that there is no reliable way of determining $input_encoding automatically. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^2: I need to convert the text into UTF-8 from any charset by mohiddinb (Initiate) on Aug 09, 2007 at 12:40 UTC
Thank you joost but it seems its not working fine some of the characters are still malformed i have even tried 1)Text::Iconv 2)use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); and finally ur suggestion(Encode) but still getting malformed characters any how thanks for ur help Regards, Mohiddin Baig	[reply]
Re^3: I need to convert the text into UTF-8 from any charset by Joost (Canon) on Aug 09, 2007 at 12:43 UTC
Then you're doing something else wrong. Take a look at Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode? for a discussion with several examples. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: I need to convert the text into UTF-8 from any charset by graff (Chancellor) on Aug 08, 2007 at 13:18 UTC
Here's a basic script that will run as a "stdin-stdout filter" -- that is, it always reads data from STDIN and prints it to STDOUT, so to run it you always pipe or redirect its input and output, like this `enc-converter cp1252 < file.txt > file_utf8.txt #or some-process \| enc-converter shiftjis \| some-utf8-process #or a mix: some-process \| enc-converter koi8-r > file_utf8.txt enc-converter iso-8859-1 < file.txt \| utf8-process` [download] Note that you must provide the name of the input encoding as the command-line argument (`$ARGV[0]`): `#!/usr/bin/perl use strict; ( @ARGV == 1 and $ARGV[0] =~ /^\w[-\w]+$/ and ! -t ) or die "Usage: $0 inp-enc < file.inp-enc > file.utf8\n"; my $inp_enc = sprintf( ":encoding(%s)", shift ); binmode STDIN, $inp_enc; binmode STDOUT, ":utf8"; print while (<>);` [download] The Encode manual provides some instructions on how to get a listing of the names of known encodings usable with the ":encoding(...)" technique. This command will print the list: `perl -MEncode -le 'print for(Encode->encodings(":all"))'` [download] (In a windows/dos shell, you need to invert the single- and double-quotes.) As Joost pointed out, you need to know in advance what the input encoding is, because it would be a lot more work to write code that would guess the input encoding automatically. (This can be done, but you need valid training data for each combination of language + encoding you might encounter in order to build models, then you test each input stream against each model and hope that the best match is the right one.) update: As you might expect, given the simplicity of the script shown above, it's not that much more typing just to do character conversion as a perl one-liner: `perl -CO -pe 'BEGIN{binmode STDIN,":encoding(cp936)"}' < file.txt > fi +le.utf8` [download] The "-C" option with capital letter "O" sets STDOUT to utf8 (so does "-C2"); the script itself is just the BEGIN block to set the encoding for STDIN; the "-p" option does the rest.	[reply] [d/l] [select]
Re: I need to convert the text into UTF-8 from any charset by ww (Archbishop) on Aug 08, 2007 at 10:07 UTC
Perhaps you haven't realized, but PM is an educational institution; not a code factory that automagically hands out free samples. While we sometimes do your work for you, we're far more apt -- and often, more able -- to help if you show that you've made some effort. In cases like yours, that means: show us what you've tried, and be a bit more specific if you can: what isn't doing what you expect or want. So, please read How do I post a question effectively?; then post your code (as an update to the above). And welcome to the Monastery, with this hint: check out the `Encode::...` family.	[reply]
Re^2: I need to convert the text into UTF-8 from any charset by atemon (Chaplain) on Aug 08, 2007 at 10:27 UTC
Hi, Welcome to Monastery! Almost this is discussed yesterday on perlmonks at Is utf8, ascii ?. It was little more related to database, but covers UTF-8 conversion. Have a look at PerlMonks FAQ, The Perl Monks Guide to the Monastery for more information on how the Monastery works. How (Not) To Ask A Question will be a good guide. There are hundreds of thousands of posts here, Super Search is your friend. Cheers ! --VC There are three sides to any argument..... your side, my side and the right side.	[reply]
Re: I need to convert the text into UTF-8 from any charset by ForgotPasswordAgain (Vicar) on Aug 08, 2007 at 10:14 UTC
I think it's impossible in principle to do it generally (a given string is a bunch of numbers that can be valid in more than one charset), but I'd like to see a module that does it well based on whatever heuristics. I see a couple CPAN modules, but I'm not sure how well they work.	[reply]