snaporaz has asked for the wisdom of the Perl Monks concerning the following question:

Hello, this is my first posting here. I need help with something basic, but I couldn't figure it out myself, and browsed through several sites, beside this, before posting. So: I need to convert files from ASCII into Unicode UTF8. I work on WindowsXP, use ActiveState Perl 5.8 . I tried this (and many other variants...):
use Encode; $infile=shift; $outfile=shift; open IN, $infile or die; open OUT, ">:utf8", "$outfile" or die; while(<IN>) { print OUT $_; }
It does not work. It actually screws up upper ASCII characters that are OK when NOT using the utf8 part in opening the output file. Any idea about what I am missing here?

Replies are listed 'Best First'.
Re: writing utf8 files under WindowsXP
by kvale (Monsignor) on Apr 28, 2004 at 16:46 UTC
    The ASCII standard only defines characters with the lower 7 bits of a byte - the upper bit is set to zero, or a partiy bit. The 'upper' ASCII characters are not ASCII at all, but some coding extension. You first need to discover what encoding you are using before you can convert to UTF-8.

    -Mark

Re: writing utf8 files under WindowsXP
by matija (Priest) on Apr 28, 2004 at 18:02 UTC
    Assuming your source is a character set that shows the correct characters for above-128 codes, you're probably using windows-1250 or simmilar.

    So you should probably replace the loop with:

    while (<IN>) {from_to($_,"cp1250", "utf8"); print OUT $_;}
      That will change $_ to contain utf8 data but not be marked by perl as utf8; so if you do that, binmode OUT and omit the ":utf8" on open.
      Sorry for my ignorance: I tried your suggestion:
      while (<IN>) {from_to($_,"cp1250", "utf8"); print OUT $_;}
      but it complains about from_to, and I do have "use Encode" at the top of the script. What else does from_to need? Thanks
        You need to use Encode like this:
        use Encode qw/from_to/; ... from_to( $_, "cp1250", "utf8" ); ...
        Alternatively, you could use it like this:
        use Encode; ... Encode::from_to( $_, "cp1250", "utf8" ); ...
        This is because by default, Encode does not export the "from_to" function into the "main" namespace of your script.
Re: writing utf8 files under WindowsXP
by borisz (Canon) on Apr 28, 2004 at 18:33 UTC
    Perhaps you need to change the input charset.
    perl -Mencoding=latin1,STDOUT,utf8 -pe1
    Boris