script inserts \x00 bytes on WinXP

dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script for a friend which takes a file as input, compares lines, and writes sorted, unduplicated lines to a file. (effectively "sort file | uniq")

When I run it on OSX, it works fine; when he runs in on WinXP, the output file contains \x00 as every other byte.

The input has a few \xb0 characters. I've tried binmode, I've tried s/[^[:ascii:]]/ /g to deal with only ascii, but the \x00 characters continue to be inserted into the output.

What am I missing? I don't think it's a UTF8 issue...

Thanks, Doug

complete script

#!/usr/bin/perl -w
# This script expects two file names for input and output.
# it reads in the whole input file, separates lines based on \n charac
+ter,
# gets rid of duplicate lines, sorts the remaining lines,
# and prints them to a file.
use strict;
use Getopt::Std;
use vars qw( $content @list %seen @uniqu @sorted $opt_i $opt_o );

# get the in/out file names and do some error checks
getopts('i:o:') or die "Usage : $0 -i infile -o outfile\n";
print "reading \"$opt_i\" and writing to \"$opt_o\"\n";
(-e "$opt_i")  or die "Usage : $0 -i infile -o outfile\n";
($opt_o)  or die "Usage : $0 -i infile -o outfile\n";
if (-e "$opt_o") { die "$0 : will not write to existing file \"$opt_o\
+"\n"; }

# read in the input file
undef $/;
open(FIN,"$opt_i") or die "$0 : cannot read input file \"$opt_i\"\n";
binmode(FIN);
$content = <FIN>;
close(FIN);

@list = split(/\n/,$content);
print "read ", scalar(@list), " lines in from \"$opt_i\", ";

# get rid of the duplicate lines
%seen = ();
@uniqu = grep { ! $seen{$_} ++ } @list;
# sort with "cmp" is alphabetic, with "<=>" is numeric
@sorted = sort { $a cmp $b } @uniqu;

# print the sorted, unique lines to a file
print "writing ", scalar(@sorted), " lines out to \"$opt_o\"\n";
open(FOUT,">$opt_o") or die "$0 : cannot write output file \"$opt_o\"\
+n";
binmode(FOUT);
print FOUT join("\n", @sorted);
close(FOUT);

exit;
[download]

sample data; the "degree" character is \xb0.

Loremn Ipsum; Tseribow;26/06/04 16:28; 49?17'010N - 050?36'073W; WGS84
+;Seg; 11.1Tidos; Fog; 186.6?;
Loremn Ipsum; Tseria;25/08/07 23:16; 43?49'528S - 065?29'077E; ED50;Se
+g; 4.1Tidos; Fog; 132.8?;
Loremn Ipsum; Tseribow;26/06/04 16:48; 39?16'733N - 040?36'086W; WGS84
+;Seg; 10.7Tidos; Fog; 207.7?;
[download]

Update:

It's saturday, and I have good questions to ask Jack on monday, so I'm going to back burner this until I get some feedback from him. Thanks, everyone!

Final update:

He was running another piece of software that changed the output to Unicode. D'oh! But what I learned here helped to deal with that issue.

Comment on script inserts \x00 bytes on WinXP Select or Download Code

Replies are listed 'Best First'.
Re: script inserts \x00 bytes on WinXP by shmem (Chancellor) on Sep 05, 2008 at 18:45 UTC
I don't think it's a UTF8 issue... You're probably right; it looks more like an UTF-16 (a.k.a. Windows Unicode) issue. See Encode and Encode::Supported. Try the following in your script: `use Encode qw(from_to); ... from_to( $_, 'UTF-16LE', 'latin-1') for @sorted; print FOUT join("\n", @sorted);` [download] Change the `'latin-1'` to the actual encoding you want.	[reply] [d/l] [select]
Re^2: script inserts \x00 bytes on WinXP by ikegami (Patriarch) on Sep 05, 2008 at 22:37 UTC
I think you meant that as a debugging tool? Why else would you do the conversion so late. Here's what the final code should probably look like: `... my $content; { open(my $fin, '<:encoding(UTF-16le)', $opt_i) or die "$0 : cannot read input file \"$opt_i\"\n"; local $/; $content = <$fin>; } ... { open(my $fout, '>:encoding(iso-latin-1)', $opt_o) or die "$0 : cannot write output file \"$opt_o\"\n"; print $fout join("\n", @sorted); }` [download]	[reply] [d/l]
Re^3: script inserts \x00 bytes on WinXP by dwhite20899 (Friar) on Sep 06, 2008 at 01:38 UTC
the output file contains one huge line of \x{4445}\x{3035}\x{533b}\x{676f}\x{203b}\x{2e34}\x{4e37}\x{6475}\x{736f}\x{203b}\x{6f43} Seriously, an "od -c" of the output file shows this: `0000000 \ x { 7 6 4 5 } \ x { 6 e 6 5 + } 0000020 \ x { 6 f 7 4 } \ x { 4 2 2 0 + } 0000040 \ x { 7 1 7 5 } \ x { 6 5 7 5 + } 0000060 \ x { 2 0 3 b } \ x { 5 f 5 2 + } 0000100 \ x { 7 5 5 4 } \ x { 6 2 7 2 + }` [download] I actually got my hands on a WinXP machine running ActivePerl and ran the script with this code on XP, and got the above output. Horrible thing is (for my friend) when I run my original script on XP, it works like I expect - NOT producing the \x00 bytes! I don't understand this, and I can't replicate it, and I can't visit him (10 time zones away) to see WTF is going on.	[reply] [d/l]
Re^4: script inserts \x00 bytes on WinXP by ikegami (Patriarch) on Sep 06, 2008 at 01:53 UTC
Re^2: script inserts \x00 bytes on WinXP by dwhite20899 (Friar) on Sep 06, 2008 at 01:45 UTC
I just get lines of question marks (joined by \n) as output. `??????????????????????????????????????????????? ???????????????????????????????????????????????? ????????????????????????????????????????????????` [download]	[reply] [d/l]
Re^3: script inserts \x00 bytes on WinXP by Anonymous Monk on Sep 06, 2008 at 07:43 UTC
Try hexdump/od utilities `perl blah ... \| hexdump or perl blah ... \| od -tacx1` [download] You can get them from http://gnuwin32.sourceforge.net/packages/hextools.htm http://gnuwin32.sourceforge.net/packages/coreutils.htm	[reply] [d/l]
Re^4: script inserts \x00 bytes on WinXP by dwhite20899 (Friar) on Sep 06, 2008 at 11:24 UTC
Re^5: script inserts \x00 bytes on WinXP by Anonymous Monk on Sep 06, 2008 at 11:56 UTC
Some notes below your chosen depth have not been shown here
Re: script inserts \x00 bytes on WinXP by graff (Chancellor) on Sep 06, 2008 at 00:48 UTC
Since you have "binmode" where it belongs on both file handles, and it works on macosx, I have to assume the problem on the WinXP system is either a brain-damaged configuration around the perl installation on that box, or something else affecting the output file after perl has finished writing it, or else the input file happens to already have the same set of null bytes as you find in the output file. Have you checked the input file? What exactly happens to the output file after perl writes it? What is the first thing used to open and inspect the file's content? As mentioned in a previous reply, when there is a null byte next to each character byte, it's a sure sign of UTF-16 encoding; if the initial byte is null (or the first two bytes are "\xFE \xFF") it's UTF-16BE; if the second byte is null (or the first two bytes are "\xFF \xFE") it's UTF-16LE. You can do a simple test on the victim's WinXP box to see if perl is brain-damaged -- e.g.: `#!/usr/bin/perl $out = "test\xb0 test"; open(O, ">test.txt") or die "$!"; binmode O; print O $out; close O; $s = -s "test.txt"; print "wrote ".length($out)." bytes to test.txt; file size is $s bytes +\n";` [download] The report should show the same number of bytes for the string length and the file size (and that should be 10); next you check the file by other means, and see whether, at some point, its contents change when you open it with some particular windows tool. As for your script, I don't understand why you want to have three copies of the file data in memory (single slurped scalar, array of lines, hash of lines). Why not do it like this? #!/usr/bin/perl use strict; use warnings; use Getopt::Std; our ( $opt_i, $opt_o ); my ( $ifh, $ofh ); getopts( 'i:o:' ) and $opt_i and $opt_o or die "Usage: $0 -i infile -o outfile\n"; warn "reading \"$opt_i\" and writing to \"$opt_o\"\n"; open( $ifh, "<", $opt_i ) or die "$opt_i: $!\n"; open( $ofh, ">", $opt_o ) or die "$opt_o: $!\n"; binmode $ifh; binmode $ofh; my %lines; $lines{$_}++ while (<$ifh>); my $line_count = 0; for (sort keys %lines) { print $ofh $_; $line_count += $lines{$_}; } close $ofh or die "error on closing output file: $!\n"; warn "read $line_count lines from $opt_i, wrote ".scalar(keys %lines). " lines to $opt_o\n"; [download] (Note the extra conditions after getopts: it will return true if no option flags are given at all -- that's why they are called "option" flags...) Update: after posting, I noticed that the OP code reported input and output line counts, so I added stuff to my version of the script accordingly.	[reply] [d/l] [select]
Re^2: script inserts \x00 bytes on WinXP by dwhite20899 (Friar) on Sep 06, 2008 at 01:59 UTC
graff: I was only provided a snippet of the real input file; I don't know if I have the actual first two bytes of the file. I'll ask my friend about that. What should happen to the output file is that another perl script is used to sort the lines by date/time for a different view of the data. My friend has looked at it with Notepad, Notepad2 and a hex editor, and he says it comes out of the undupe script padded with \x00 bytes. My OSX passes the braindead test, so does my WinXP which is supposedly now set up the way his is. I'll send him that test, thanks! 3 copies - because I'm an idiot, and have RAM to burn. :-) That is a very nice way you did it. That should scale up nicely.	[reply]
Re^3: script inserts \x00 bytes on WinXP by Anonymous Monk on Sep 06, 2008 at 07:47 UTC
Maybe he has funny stuff in PERL5OPT, try comparing set output and cross reference with perlrun	[reply]
Re^4: script inserts \x00 bytes on WinXP by dwhite20899 (Friar) on Sep 06, 2008 at 13:30 UTC
Re: script inserts \x00 bytes on WinXP by gok8000 (Scribe) on Sep 05, 2008 at 20:21 UTC
I've looked a comparative ASCII table between Mac and Win. x00 is the null on both platforms. xb0 is infinity on Mac, but has no correspondence on Win. So probably on Win x00 comes out in place of xb0. A solution might be to global substitute xb0 with x27, the "degree", before any processing. And probably you will be off.	[reply]
Re^2: script inserts \x00 bytes on WinXP by ikegami (Patriarch) on Sep 05, 2008 at 22:28 UTC
Byte 0xB0 exists on Windows as well. It may not display the same, but instances of 0xB0 don't suddenly become 0x00 based on the OS being used to read the file.	[reply]