dexter29 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am looking to convert a .txt file to UTF-8 encoding. Any help with this would be fantastic. I have tried the following but the file is still the same size and has not changed to UTF-8 encoding.
my $File = 'X:\TestData.txt'; my @FileData = do { local(*ARGV); @ARGV=$File; <> }; map {Encode::encode("utf-8", $_)} @FileData; WriteFile('X:\Test.txt', @FileData);
### Got it due to the info below. Many thanks.

Replies are listed 'Best First'.
Re: Text file to UTF-8 encoding
by FunkyMonk (Bishop) on Jul 31, 2008 at 10:39 UTC
    map {Encode::encode("utf-8", $_)} @FileData;
    Shouldn't you be saving the result of the map?
    @FileData = map {Encode::encode("utf-8", $_)} @FileData;


    Unless I state otherwise, all my code runs with strict and warnings
Re: Text file to UTF-8 encoding
by moritz (Cardinal) on Jul 31, 2008 at 10:36 UTC
    In what encoding is it now? Suppose it's in encoding $x, try this:
    use Encode qw(from_to); my $str = do { local(*ARGV); @ARGV=$File; <> }; open my $out, '>', 'X:\TestData.txt.UTF-8' or die $!; print $out from_to($str, $x, 'UTF-8'); close $out or die $!;

    Of course you can also just use IO layers.

    open my $in, "<:encoding($x)", $filename or die $!; open my $out, ">:encoding(UTF-8)", "$filename.utf8" or die $!; select $out; pirnt while (<$in>); close $in or die $! close $out or die $!;

    Your version can't work because Encode::encode doesn't modify its arguments, and you're ignoring the return value of the map.

    (Update: fixed typo in from_to, ambrus++)

Re: Text file to UTF-8 encoding
by davorg (Chancellor) on Jul 31, 2008 at 11:03 UTC
Re: Text file to UTF-8 encoding
by AltBlue (Chaplain) on Jul 31, 2008 at 14:39 UTC
    Perl distributions usually include the piconv utility program (GNU iconv(1) "reinvented" in Perl). E.g.:
    $ piconv -f iso-8859-2 -t utf8 latin2file.txt > utf8file.txt
    Update: s/cpan:/doc:/ in piconv link
Re: Text file to UTF-8 encoding
by sasdrtx (Friar) on Jul 31, 2008 at 14:42 UTC
    Depending on your input text, a UTF-8 encoding could easily not increase the file size by much. I believe a two-byte indicator must be added at the beginning of the file; but the common ASCII characters from 0-127 are UTF-8 encoded as-is.

    Of course, you do have to actually modify your data to get any difference (see previous replies).


    sas
      believe a two-byte indicator must be added at the beginning of the file;
      s/two-byte/three-byte/; s/must /, the codepoint 0xFEFF (in UTF-8, "\xEF\xBB\xBF"), can/;
      []s, HTH, Massa (κς,πμ,πλ)
      much. I believe a two-byte indicator must be added at the beginning of the file

      I believe you are referring to the Byte Order Mark, which is by no means mandatory. It is used for UTF-16 and UTF-32 because there endianess matters.

      And the byte order mark in UTF-8 is three bytes (EF BB BF), not two.