Perl to Read/Write Window Unicode Text files

maylin has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl to Read/Write Window Unicode Text files by ikegami (Patriarch) on Aug 10, 2010 at 20:48 UTC
It's UCS-2le. UTF-16le, a superset of UCS-2le, also works. (Case doesn't matter.) You should use UTF-16le when reading. You could use UCS-le for writing if you wanted to be strict, but UTF-16le should be fine too. (Just be aware that UTF-16 is a variable-width encoding, so there might be some differences of opinion as to how long a string is.) Note that Perl might mangle some characters due to the unfortunate order in which the :crlf layer is placed in relation to the :encoding layer. You can avoid that problem using `# Preserve CRLF open(my $fh, '<:raw:perlio:encoding(UTF-16le)', ...) # CRLF->LF open(my $fh, '<:raw:perlio:encoding(UTF-16le):crlf', ...)` [download]	[reply] [d/l]
Re: Perl to Read/Write Window Unicode Text files by morgon (Priest) on Aug 10, 2010 at 20:19 UTC
Try UTF-16 as ancoding an post the full code including error messages.	[reply]
Re^2: Perl to Read/Write Window Unicode Text files by maylin (Initiate) on Aug 10, 2010 at 20:51 UTC
Morgon - Thanks for your help. When I run it using UTF-16, Perl finished running without any error message but my output includes some junks like following: original data line: 5\|\|1\|\|\|SFDC\|\| new data line: 簀㄀簀㄀㐀㘀簀簀䠀䠀簀簀ഀഊ409\|\|1\|\|\|SFDC\|\| Following is my code. It works fine on ANSI text file. # this program is to fix line breaks in a "\|" (or "," with minor changes in the code) delimted file #!/usr/bin/perl -w ############################################################################## # Following are some parameters you may need to change before running program ############################################################################## $file_folder="C:\\Test"; $log_folder="C:\\Test"; $ori_datafile="CT_Vendor_Summary.txt"; $new_datafile="CT_Vendor_Summary_new.txt"; $fix_log="LineBreak_fix_log.txt"; $error_log="LineBreak_error_log.txt"; $rptfile="LineBreak_fix_report.txt"; ######################################################################## $brklinenum=0; $fixlinenum=0; $newline; #open FH, "$file_folder\\$ori_datafile" or die "can't open file"; open FH, "<:encoding(UTF-16)", "$file_folder\\$ori_datafile" or die "can't open file"; $/="\n"; $ori_line_number=0; $pipe_thisline=0; $pipe_sum=0; $right_pipes=7; #open (NEWFILE, ">$file_folder\\$new_datafile") or die "can't open file"; open (NEWFILE, ">:encoding(UTF-16)","$file_folder\\$new_datafile") or die "can't open file"; open (FIXLOG, ">$log_folder\\$fix_log") or die "can't open file"; open (ERRLOG, ">$log_folder\\$error_log") or die "can't open file"; open (FIXRPT, ">$log_folder\\$rptfile") or die "can't open file"; while (<FH>) { chop; # aviod \n in last field; $ori_line_number=$ori_line_number+1; # if ($_ =~ /\r/) {print OUT1 "$count\n";} $pipe_thisline=($_ =~ tr/\\|/\\|/); if ($ori_line_number eq 1) { print FIXRPT "Report on Fixing Line Breaks in CT_Vendor_Summary File\n\n"; print FIXRPT "Correct number of pipes in each line is $right_pipes \n\n"; } if ($pipe_thisline eq $right_pipes) { if ($pipe_sum eq 0) { print NEWFILE "$_" . "\n"; } else { print ERRLOG "A: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } } ### else { if ( $pipe_thisline > $right_pipes ) { print ERRLOG "B: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } else { if ($pipe_sum eq 0 ) { $pipe_sum=$pipe_thisline; print FIXLOG "Break: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$_"."\n"; $newline = $_; $brklinenum=$brklinenum+1; } else { $pipe_sum = $pipe_sum + $pipe_thisline; $newline = "$newline" . " " . "$_"; $brklinenum=$brklinenum+1; if ($pipe_sum > $right_pipes) { print ERRLOG "C: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print ERRLOG " " . "$_" . "\n"; } elsif ($pipe_sum eq $right_pipes ) { $fixlinenum=$fixlinenum+1; print NEWFILE "$newline" . "\n"; print FIXLOG "Fixed: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$newline" . "\n"; $pipe_sum=0; $newline=""; } else { print FIXLOG "Break: Original Line #: $ori_line_number; Pipes this line: $pipe_thisline; \$pipe_sum: $pipe_sum\n"; print FIXLOG " " . "$newline" . "\n"; } } } } } $newlinenum=$ori_line_number + $fixlinenum - $brklinenum; print FIXRPT "Total Line # in Original File: $ori_line_number\n"; print FIXRPT "Total #of Line breaks: $brklinenum \n"; print FIXRPT "After fixing: $fixlinenum \n"; print FIXRPT "Total Line # in New File: $newlinenum \n";	[reply]
Re^3: Perl to Read/Write Window Unicode Text files by ikegami (Patriarch) on Aug 10, 2010 at 20:54 UTC
First, that's completely unreadable. Please add `<c>..</c>` tags around computer text such as code, data, output, etc. Secondly, it's way too long. It should not require more than 5 lines to demonstrate this problem. Finally, it would be more fruitful to provide a hex dump of the data then posting funky characters.	[reply] [d/l]
Re^4: Perl to Read/Write Window Unicode Text files by maylin (Initiate) on Aug 10, 2010 at 21:06 UTC
Re^5: Perl to Read/Write Window Unicode Text files by ikegami (Patriarch) on Aug 10, 2010 at 21:31 UTC
Some notes below your chosen depth have not been shown here
Re^3: Perl to Read/Write Window Unicode Text files by Anonymous Monk on Aug 10, 2010 at 22:19 UTC
`#!/usr/bin/perl -w ###################################################################### +########` [download] It is usually good practice to use the warnings and strict pragmas: `#!/usr/bin/perl use warnings; use strict; ###################################################################### +########` [download] `open FH, "<:encoding(UTF-16)", "$file_folder\\$ori_datafile" or die + "can't open file"; ... #open (NEWFILE, ">$file_folder\\$new_datafile") or die "can't open fil +e"; open (NEWFILE, ">:encoding(UTF-16)","$file_folder\\$new_datafile") or +die "can't open file"; open (FIXLOG, ">$log_folder\\$fix_log") or die "can't open file"; open (ERRLOG, ">$log_folder\\$error_log") or die "can't open file"; open (FIXRPT, ">$log_folder\\$rptfile") or die "can't open file";` [download] It is usually a good idea to include the $! or $^E variable in your error message so that you know why open failed. `chop; # aviod \n in last field;` [download] Most modern Perl programs use chomp instead of chop. `$ori_line_number=$ori_line_number+1;` [download] That is usually written as: `$ori_line_number += 1;` [download] Or simply: `$ori_line_number++;` [download] `$pipe_thisline=($_ =~ tr/\\|/\\|/);` [download] `tr///` does not interpolate so the back-slashes are not necessary. Also, if the replacement character list is the same as the search character list then the replacement character list can be omitted, and the default binding is to the `$_` variable so that can be omitted as well, so: `$pipe_thisline = tr/\|//;` [download] `if ($pipe_thisline eq $right_pipes) { ... if ( $pipe_thisline > $right_pipes ) {` [download] Are `$pipe_thisline` and `$right_pipes` numeric or text, because you are using numeric comparison in one place and text comparison in another. `if ($ori_line_number eq 1) { print FIXRPT "Report on Fixing Line Bre +aks in CT_Vendor_Summary File\n\n"; ... else { if ($pipe_sum eq 0 ) { ... elsif ($pipe_sum eq $right_pipes ) {` [download] Why are you using text comparison on numeric values?	[reply] [d/l] [select]