in reply to Perl to Read/Write Window Unicode Text files

Try UTF-16 as ancoding an post the full code including error messages.
  • Comment on Re: Perl to Read/Write Window Unicode Text files

Replies are listed 'Best First'.
Re^2: Perl to Read/Write Window Unicode Text files
by maylin (Initiate) on Aug 10, 2010 at 20:51 UTC
    Morgon - Thanks for your help. When I run it using UTF-16, Perl finished running without any error message but my output includes some junks like following: original data line: 5||1|||SFDC|| new data line: 簀㄀簀㄀㐀㘀簀簀䠀䠀簀簀ഀഊ409||1|||SFDC|| Following is my code. It works fine on ANSI text file. # this program is to fix line breaks in a "|" (or "," with minor changes in the code) delimted file #!/usr/bin/perl -w ############################################################################## # Following are some parameters you may need to change before running program ############################################################################## $file_folder="C:\\Test"; $log_folder="C:\\Test"; $ori_datafile="CT_Vendor_Summary.txt"; $new_datafile="CT_Vendor_Summary_new.txt"; $fix_log="LineBreak_fix_log.txt"; $error_log="LineBreak_error_log.txt"; $rptfile="LineBreak_fix_report.txt"; ######################################################################## $brklinenum=0; $fixlinenum=0; $newline; #open FH, "$file_folder\\$ori_datafile" or die "can't open file"; open FH, "<:encoding(UTF-16)", "$file_folder\\$ori_datafile" or die "can't open file"; $/="\n"; $ori_line_number=0; $pipe_thisline=0; $pipe_sum=0; $right_pipes=7; #open (NEWFILE, ">$file_folder\\$new_datafile") or die "can't open file"; open (NEWFILE, ">:encoding(UTF-16)","$file_folder\\$new_datafile") or die "can't open file"; open (FIXLOG, ">$log_folder\\$fix_log") or die "can't open file"; open (ERRLOG, ">$log_folder\\$error_log") or die "can't open file"; open (FIXRPT, ">$log_folder\\$rptfile") or die "can't open file"; while (<FH>) { chop; # aviod \n in last field; $ori_line_number=$ori_line_number+1; # if ($_ =~ /\r/) {print OUT1 "$count\n";} $pipe_thisline=($_ =~ tr/\|/\|/); if ($ori_line_number eq 1) { print FIXRPT "Report on Fixing Line Breaks in CT_Vendor_Summary File\n\n"; print FIXRPT "Correct number of pipes in each line is $right_pipes \n\n"; } if ($pipe_thisline eq $right_pipes) { if ($pipe_sum eq 0) { print NEWFILE "$_" . "\n"; } else { print ERRLOG "A: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } } ### else { if ( $pipe_thisline > $right_pipes ) { print ERRLOG "B: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } else { if ($pipe_sum eq 0 ) { $pipe_sum=$pipe_thisline; print FIXLOG "Break: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$_"."\n"; $newline = $_; $brklinenum=$brklinenum+1; } else { $pipe_sum = $pipe_sum + $pipe_thisline; $newline = "$newline" . " " . "$_"; $brklinenum=$brklinenum+1; if ($pipe_sum > $right_pipes) { print ERRLOG "C: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } elsif ($pipe_sum eq $right_pipes ) { $fixlinenum=$fixlinenum+1; print NEWFILE "$newline" . "\n"; print FIXLOG "Fixed: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$newline" . "\n"; $pipe_sum=0; $newline=""; } else { print FIXLOG "Break: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$newline" . "\n"; } } } } } $newlinenum=$ori_line_number + $fixlinenum - $brklinenum; print FIXRPT "Total Line # in Original File: $ori_line_number\n"; print FIXRPT "Total #of Line breaks: $brklinenum \n"; print FIXRPT "After fixing: $fixlinenum \n"; print FIXRPT "Total Line # in New File: $newlinenum \n";

      First, that's completely unreadable. Please add <c>..</c> tags around computer text such as code, data, output, etc.

      Secondly, it's way too long. It should not require more than 5 lines to demonstrate this problem.

      Finally, it would be more fruitful to provide a hex dump of the data then posting funky characters.

        I found that, when you use UTF-16, Perl actually output a "Unicode big endian" text file, not "Unicode". Probably only following two lines of code are important to check the issue: open FH, "<:encoding(UTF-16)", "$inputFile" or die "can't open file"; open (NEWFILE, ">:encoding(UTF-16)","$outputFile") or die "can't open file";
      #!/usr/bin/perl -w ###################################################################### +########

      It is usually good practice to use the warnings and strict pragmas:

      #!/usr/bin/perl use warnings; use strict; ###################################################################### +########

      open FH, "&lt;:encoding(UTF-16)", "$file_folder\\$ori_datafile" or die + "can't open file"; ... #open (NEWFILE, ">$file_folder\\$new_datafile") or die "can't open fil +e"; open (NEWFILE, ">:encoding(UTF-16)","$file_folder\\$new_datafile") or +die "can't open file"; open (FIXLOG, ">$log_folder\\$fix_log") or die "can't open file"; open (ERRLOG, ">$log_folder\\$error_log") or die "can't open file"; open (FIXRPT, ">$log_folder\\$rptfile") or die "can't open file";

      It is usually a good idea to include the $! or $^E variable in your error message so that you know why open failed.


      chop; # aviod \n in last field;

      Most modern Perl programs use chomp instead of chop.


      $ori_line_number=$ori_line_number+1;

      That is usually written as:

      $ori_line_number += 1;

      Or simply:

      $ori_line_number++;

      $pipe_thisline=($_ =~ tr/\|/\|/);

      tr/// does not interpolate so the back-slashes are not necessary.    Also, if the replacement character list is the same as the search character list then the replacement character list can be omitted, and the default binding is to the $_ variable so that can be omitted as well, so:

      $pipe_thisline = tr/|//;

      if ($pipe_thisline eq $right_pipes) { ... if ( $pipe_thisline > $right_pipes ) {

      Are $pipe_thisline and $right_pipes numeric or text, because you are using numeric comparison in one place and text comparison in another.


      if ($ori_line_number eq 1) { print FIXRPT "Report on Fixing Line Bre +aks in CT_Vendor_Summary File\n\n"; ... else { if ($pipe_sum eq 0 ) { ... elsif ($pipe_sum eq $right_pipes ) {

      Why are you using text comparison on numeric values?