Text file to UTF-8 encoding

dexter29 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Text file to UTF-8 encoding by FunkyMonk (Bishop) on Jul 31, 2008 at 10:39 UTC
`map {Encode::encode("utf-8", $_)} @FileData;` Shouldn't you be saving the result of the map? `@FileData = map {Encode::encode("utf-8", $_)} @FileData;` [download] Unless I state otherwise, all my code runs with strict and warnings	[reply] [d/l] [select]
Re: Text file to UTF-8 encoding by moritz (Cardinal) on Jul 31, 2008 at 10:36 UTC
In what encoding is it now? Suppose it's in encoding `$x`, try this: `use Encode qw(from_to); my $str = do { local(*ARGV); @ARGV=$File; <> }; open my $out, '>', 'X:\TestData.txt.UTF-8' or die $!; print $out from_to($str, $x, 'UTF-8'); close $out or die $!;` [download] Of course you can also just use IO layers. `open my $in, "<:encoding($x)", $filename or die $!; open my $out, ">:encoding(UTF-8)", "$filename.utf8" or die $!; select $out; pirnt while (<$in>); close $in or die $! close $out or die $!;` [download] Your version can't work because `Encode::encode` doesn't modify its arguments, and you're ignoring the return value of the `map`. (Update: fixed typo in `from_to`, ambrus++)	[reply] [d/l] [select]
Re: Text file to UTF-8 encoding by davorg (Chancellor) on Jul 31, 2008 at 11:03 UTC
FunkyMonk is right. encode doesn't alter its arguments in place. You need to catch them in a variable. `@FileData = map { Encode::encode('utf-8', $_) } @FileData;` [download] You're currently doing a lot of work but ignoring the results. -- See the Copyright notice on my home node. "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re: Text file to UTF-8 encoding by AltBlue (Chaplain) on Jul 31, 2008 at 14:39 UTC
Perl distributions usually include the piconv utility program (GNU `iconv(1)` "reinvented" in Perl). E.g.: `$ piconv -f iso-8859-2 -t utf8 latin2file.txt > utf8file.txt` [download] Update: `s/cpan:/doc:/` in `piconv` link	[reply] [d/l] [select]
Re: Text file to UTF-8 encoding by sasdrtx (Friar) on Jul 31, 2008 at 14:42 UTC
Depending on your input text, a UTF-8 encoding could easily not increase the file size by much. I believe a two-byte indicator must be added at the beginning of the file; but the common ASCII characters from 0-127 are UTF-8 encoded as-is. Of course, you do have to actually modify your data to get any difference (see previous replies). sas	[reply]
Re^2: Text file to UTF-8 encoding by massa (Hermit) on Jul 31, 2008 at 15:04 UTC
believe a two-byte indicator must be added at the beginning of the file; `s/two-byte/three-byte/; s/must /, the codepoint 0xFEFF (in UTF-8, "\xEF\xBB\xBF"), can/;` [download] []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l]
Re^2: Text file to UTF-8 encoding by moritz (Cardinal) on Jul 31, 2008 at 15:09 UTC
much. I believe a two-byte indicator must be added at the beginning of the file I believe you are referring to the Byte Order Mark, which is by no means mandatory. It is used for UTF-16 and UTF-32 because there endianess matters. And the byte order mark in UTF-8 is three bytes (EF BB BF), not two.	[reply]