Conversion from UTF-8 to windows-1256 encoding

iman_saleh has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I wrote the following piece of code to convert a UTF-8 text into windows-1256 encoding. I don't know why the encoding of the output file is Unicode big endian. Please if anyone can help me with that tell me where is the problem.

#-----------------------------------------------
use Unicode::UTF8simple;
$uref = new Unicode::UTF8simple;

open(IN, $ARGV[0]) or die;
open(OUT, ">$ARGV[1]") or die;

while(<IN>)
{
        $string = $uref->fromUTF8("windows-1256",$_);
        print OUT $string;
}
close(IN);
close(OUT);
#-----------------------------------------------
[download]

Comment on Conversion from UTF-8 to windows-1256 encoding Download Code

Replies are listed 'Best First'.
Re: Conversion from UTF-8 to windows-1256 encoding by Sixtease (Friar) on Oct 29, 2007 at 10:09 UTC
Hello. I don't know about Unicode::UTF8simple, but I think you can do without it: `open IN, "<:encoding(utf8)", $ARGV[0]; open OUT, ">:encoding(cp-1256)", $ARGV[1]; while (<IN>) { print OUT } __END__` [download] Update: open itself lets you specify the encoding for each filehandle. Perl's input/output layer does the conversion for you.	[reply] [d/l]
Re^2: Conversion from UTF-8 to windows-1256 encoding by iman_saleh (Novice) on Oct 29, 2007 at 11:25 UTC
Thanks, I tried your code it works but it displays the following message when I run it: "\x{feff}" does not map to cp1256, <IN> line 1. And the character \x{feff} is displayed at the beginning of the file before the text in the output file, I don't know why?	[reply]
Re^3: Conversion from UTF-8 to windows-1256 encoding by almut (Canon) on Oct 29, 2007 at 12:21 UTC
"\x{feff}" does not map to cp1256, <IN> line 1. And the character \x{feff} is displayed at the beginning of the file `FEFF` is the unicode character code of the BOM (Byte Order Mark). You just have to ignore it (i.e skip over or remove it from the input). (With UTF-8, the BOM has no real use (the byte order is always the same), but on Windows the BOM is generally used to identify the file as being unicode encoded.)	[reply] [d/l]
Re^4: Conversion from UTF-8 to windows-1256 encoding by ikegami (Patriarch) on Oct 29, 2007 at 17:48 UTC
Re^3: Conversion from UTF-8 to windows-1256 encoding by Sixtease (Friar) on Oct 29, 2007 at 11:50 UTC
I guess your input file contains non-utf8 characters then. I think I was in a similar situation when I was mailed some utf8-encoded text documents and for some reason I don't know, there was a non-utf8 char at the very beginning. I guess the easiest way to go is to call `getc(IN)` just before the loop. This would assume though that there indeed is an invalid character there - adding some tests on the return value of getc may be necessary if you're not sure. ...I could also be totally wrong and your input file is OK and the problem is somewhere else. update: FEFF is the unicode character code of the BOM (Byte Order Mark). I said wrongly that there is a non-UTF8 character (meaning non-utf8 byte sequence) and I was wrong. Of course, according to the error message, the unicode character only has no equivalent in cp-1256. Thanks almut for a proper explanation.	[reply] [d/l]
Re^4: Conversion from UTF-8 to windows-1256 encoding by iman_saleh (Novice) on Oct 29, 2007 at 12:19 UTC

Back to Seekers of Perl Wisdom