maylin has asked for the wisdom of the Perl Monks concerning the following question:

Can we use Perl to open or save a file in MS Window Unicode format? To avoid confusion, the "Unicode" format I'm talking about is the following: When you use Window Notepad to save a text file, you click "Save As" and you have following 4 options in the "Encoding" box: ANSI, Unicode, Unicode big endian, UTF-8. If you choose "Unicode" and save, this is the format I talk about. It's not UTF-8, but I don't know what it is, may be UCS-2 LE?? Anyway, I used a code like this: open FH, "<:encoding(UCS-2LE)", "$myfile" or die "can't open file"; I tried all different unicode formats I know, like UCS-2BE, UCS-2LE, Utf-16, Utf16LE, etc. but none of them worked. Thanks a lot for your help!!
  • Comment on Perl to Read/Write Window Unicode Text files

Replies are listed 'Best First'.
Re: Perl to Read/Write Window Unicode Text files
by ikegami (Patriarch) on Aug 10, 2010 at 20:48 UTC

    It's UCS-2le. UTF-16le, a superset of UCS-2le, also works. (Case doesn't matter.) You should use UTF-16le when reading. You could use UCS-le for writing if you wanted to be strict, but UTF-16le should be fine too. (Just be aware that UTF-16 is a variable-width encoding, so there might be some differences of opinion as to how long a string is.)

    Note that Perl might mangle some characters due to the unfortunate order in which the :crlf layer is placed in relation to the :encoding layer. You can avoid that problem using

    # Preserve CRLF open(my $fh, '<:raw:perlio:encoding(UTF-16le)', ...) # CRLF->LF open(my $fh, '<:raw:perlio:encoding(UTF-16le):crlf', ...)
Re: Perl to Read/Write Window Unicode Text files
by morgon (Priest) on Aug 10, 2010 at 20:19 UTC
    Try UTF-16 as ancoding an post the full code including error messages.
      Morgon - Thanks for your help. When I run it using UTF-16, Perl finished running without any error message but my output includes some junks like following: original data line: 5||1|||SFDC|| new data line: 簀㄀簀㄀㐀㘀簀簀䠀䠀簀簀ഀഊ409||1|||SFDC|| Following is my code. It works fine on ANSI text file. # this program is to fix line breaks in a "|" (or "," with minor changes in the code) delimted file #!/usr/bin/perl -w ############################################################################## # Following are some parameters you may need to change before running program ############################################################################## $file_folder="C:\\Test"; $log_folder="C:\\Test"; $ori_datafile="CT_Vendor_Summary.txt"; $new_datafile="CT_Vendor_Summary_new.txt"; $fix_log="LineBreak_fix_log.txt"; $error_log="LineBreak_error_log.txt"; $rptfile="LineBreak_fix_report.txt"; ######################################################################## $brklinenum=0; $fixlinenum=0; $newline; #open FH, "$file_folder\\$ori_datafile" or die "can't open file"; open FH, "<:encoding(UTF-16)", "$file_folder\\$ori_datafile" or die "can't open file"; $/="\n"; $ori_line_number=0; $pipe_thisline=0; $pipe_sum=0; $right_pipes=7; #open (NEWFILE, ">$file_folder\\$new_datafile") or die "can't open file"; open (NEWFILE, ">:encoding(UTF-16)","$file_folder\\$new_datafile") or die "can't open file"; open (FIXLOG, ">$log_folder\\$fix_log") or die "can't open file"; open (ERRLOG, ">$log_folder\\$error_log") or die "can't open file"; open (FIXRPT, ">$log_folder\\$rptfile") or die "can't open file"; while (<FH>) { chop; # aviod \n in last field; $ori_line_number=$ori_line_number+1; # if ($_ =~ /\r/) {print OUT1 "$count\n";} $pipe_thisline=($_ =~ tr/\|/\|/); if ($ori_line_number eq 1) { print FIXRPT "Report on Fixing Line Breaks in CT_Vendor_Summary File\n\n"; print FIXRPT "Correct number of pipes in each line is $right_pipes \n\n"; } if ($pipe_thisline eq $right_pipes) { if ($pipe_sum eq 0) { print NEWFILE "$_" . "\n"; } else { print ERRLOG "A: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } } ### else { if ( $pipe_thisline > $right_pipes ) { print ERRLOG "B: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } else { if ($pipe_sum eq 0 ) { $pipe_sum=$pipe_thisline; print FIXLOG "Break: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$_"."\n"; $newline = $_; $brklinenum=$brklinenum+1; } else { $pipe_sum = $pipe_sum + $pipe_thisline; $newline = "$newline" . " " . "$_"; $brklinenum=$brklinenum+1; if ($pipe_sum > $right_pipes) { print ERRLOG "C: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } elsif ($pipe_sum eq $right_pipes ) { $fixlinenum=$fixlinenum+1; print NEWFILE "$newline" . "\n"; print FIXLOG "Fixed: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$newline" . "\n"; $pipe_sum=0; $newline=""; } else { print FIXLOG "Break: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$newline" . "\n"; } } } } } $newlinenum=$ori_line_number + $fixlinenum - $brklinenum; print FIXRPT "Total Line # in Original File: $ori_line_number\n"; print FIXRPT "Total #of Line breaks: $brklinenum \n"; print FIXRPT "After fixing: $fixlinenum \n"; print FIXRPT "Total Line # in New File: $newlinenum \n";

        First, that's completely unreadable. Please add <c>..</c> tags around computer text such as code, data, output, etc.

        Secondly, it's way too long. It should not require more than 5 lines to demonstrate this problem.

        Finally, it would be more fruitful to provide a hex dump of the data then posting funky characters.

        #!/usr/bin/perl -w ###################################################################### +########

        It is usually good practice to use the warnings and strict pragmas:

        #!/usr/bin/perl use warnings; use strict; ###################################################################### +########

        open FH, "&lt;:encoding(UTF-16)", "$file_folder\\$ori_datafile" or die + "can't open file"; ... #open (NEWFILE, ">$file_folder\\$new_datafile") or die "can't open fil +e"; open (NEWFILE, ">:encoding(UTF-16)","$file_folder\\$new_datafile") or +die "can't open file"; open (FIXLOG, ">$log_folder\\$fix_log") or die "can't open file"; open (ERRLOG, ">$log_folder\\$error_log") or die "can't open file"; open (FIXRPT, ">$log_folder\\$rptfile") or die "can't open file";

        It is usually a good idea to include the $! or $^E variable in your error message so that you know why open failed.


        chop; # aviod \n in last field;

        Most modern Perl programs use chomp instead of chop.


        $ori_line_number=$ori_line_number+1;

        That is usually written as:

        $ori_line_number += 1;

        Or simply:

        $ori_line_number++;

        $pipe_thisline=($_ =~ tr/\|/\|/);

        tr/// does not interpolate so the back-slashes are not necessary.    Also, if the replacement character list is the same as the search character list then the replacement character list can be omitted, and the default binding is to the $_ variable so that can be omitted as well, so:

        $pipe_thisline = tr/|//;

        if ($pipe_thisline eq $right_pipes) { ... if ( $pipe_thisline > $right_pipes ) {

        Are $pipe_thisline and $right_pipes numeric or text, because you are using numeric comparison in one place and text comparison in another.


        if ($ori_line_number eq 1) { print FIXRPT "Report on Fixing Line Bre +aks in CT_Vendor_Summary File\n\n"; ... else { if ($pipe_sum eq 0 ) { ... elsif ($pipe_sum eq $right_pipes ) {

        Why are you using text comparison on numeric values?