in reply to script inserts \x00 bytes on WinXP

Since you have "binmode" where it belongs on both file handles, and it works on macosx, I have to assume the problem on the WinXP system is either a brain-damaged configuration around the perl installation on that box, or something else affecting the output file after perl has finished writing it, or else the input file happens to already have the same set of null bytes as you find in the output file.

Have you checked the input file? What exactly happens to the output file after perl writes it? What is the first thing used to open and inspect the file's content? As mentioned in a previous reply, when there is a null byte next to each character byte, it's a sure sign of UTF-16 encoding; if the initial byte is null (or the first two bytes are "\xFE \xFF") it's UTF-16BE; if the second byte is null (or the first two bytes are "\xFF \xFE") it's UTF-16LE.

You can do a simple test on the victim's WinXP box to see if perl is brain-damaged -- e.g.:

#!/usr/bin/perl $out = "test\xb0 test"; open(O, ">test.txt") or die "$!"; binmode O; print O $out; close O; $s = -s "test.txt"; print "wrote ".length($out)." bytes to test.txt; file size is $s bytes +\n";
The report should show the same number of bytes for the string length and the file size (and that should be 10); next you check the file by other means, and see whether, at some point, its contents change when you open it with some particular windows tool.

As for your script, I don't understand why you want to have three copies of the file data in memory (single slurped scalar, array of lines, hash of lines). Why not do it like this?

#!/usr/bin/perl use strict; use warnings; use Getopt::Std; our ( $opt_i, $opt_o ); my ( $ifh, $ofh ); getopts( 'i:o:' ) and $opt_i and $opt_o or die "Usage: $0 -i infile -o outfile\n"; warn "reading \"$opt_i\" and writing to \"$opt_o\"\n"; open( $ifh, "<", $opt_i ) or die "$opt_i: $!\n"; open( $ofh, ">", $opt_o ) or die "$opt_o: $!\n"; binmode $ifh; binmode $ofh; my %lines; $lines{$_}++ while (<$ifh>); my $line_count = 0; for (sort keys %lines) { print $ofh $_; $line_count += $lines{$_}; } close $ofh or die "error on closing output file: $!\n"; warn "read $line_count lines from $opt_i, wrote ".scalar(keys %lines). " lines to $opt_o\n";
(Note the extra conditions after getopts: it will return true if no option flags are given at all -- that's why they are called "option" flags...)

Update: after posting, I noticed that the OP code reported input and output line counts, so I added stuff to my version of the script accordingly.

Replies are listed 'Best First'.
Re^2: script inserts \x00 bytes on WinXP
by dwhite20899 (Friar) on Sep 06, 2008 at 01:59 UTC
    graff:

    I was only provided a snippet of the real input file; I don't know if I have the actual first two bytes of the file. I'll ask my friend about that.

    What *should* happen to the output file is that another perl script is used to sort the lines by date/time for a different view of the data. My friend has looked at it with Notepad, Notepad2 and a hex editor, and he says it comes out of the undupe script padded with \x00 bytes.

    My OSX passes the braindead test, so does my WinXP which is supposedly now set up the way his is. I'll send him that test, thanks!

    3 copies - because I'm an idiot, and have RAM to burn. :-) That is a very nice way you did it. That should scale up nicely.

      Maybe he has funny stuff in PERL5OPT, try comparing set output and cross reference with perlrun
        He says he's not doing anything custom, and I believe him. D'oh! I'll ask him for his "set" output. :-)