I'm writing a script for a friend which takes a file as input, compares lines, and writes sorted, unduplicated lines to a file. (effectively "sort file | uniq")

When I run it on OSX, it works fine; when he runs in on WinXP, the output file contains \x00 as every other byte.

The input has a few \xb0 characters. I've tried binmode, I've tried s/[^[:ascii:]]/ /g to deal with only ascii, but the \x00 characters continue to be inserted into the output.

What am I missing? I don't think it's a UTF8 issue...

Thanks, Doug

complete script

#!/usr/bin/perl -w # This script expects two file names for input and output. # it reads in the whole input file, separates lines based on \n charac +ter, # gets rid of duplicate lines, sorts the remaining lines, # and prints them to a file. use strict; use Getopt::Std; use vars qw( $content @list %seen @uniqu @sorted $opt_i $opt_o ); # get the in/out file names and do some error checks getopts('i:o:') or die "Usage : $0 -i infile -o outfile\n"; print "reading \"$opt_i\" and writing to \"$opt_o\"\n"; (-e "$opt_i") or die "Usage : $0 -i infile -o outfile\n"; ($opt_o) or die "Usage : $0 -i infile -o outfile\n"; if (-e "$opt_o") { die "$0 : will not write to existing file \"$opt_o\ +"\n"; } # read in the input file undef $/; open(FIN,"$opt_i") or die "$0 : cannot read input file \"$opt_i\"\n"; binmode(FIN); $content = <FIN>; close(FIN); @list = split(/\n/,$content); print "read ", scalar(@list), " lines in from \"$opt_i\", "; # get rid of the duplicate lines %seen = (); @uniqu = grep { ! $seen{$_} ++ } @list; # sort with "cmp" is alphabetic, with "<=>" is numeric @sorted = sort { $a cmp $b } @uniqu; # print the sorted, unique lines to a file print "writing ", scalar(@sorted), " lines out to \"$opt_o\"\n"; open(FOUT,">$opt_o") or die "$0 : cannot write output file \"$opt_o\"\ +n"; binmode(FOUT); print FOUT join("\n", @sorted); close(FOUT); exit;

sample data; the "degree" character is \xb0.

Loremn Ipsum; Tseribow;26/06/04 16:28; 49?17'010N - 050?36'073W; WGS84 +;Seg; 11.1Tidos; Fog; 186.6?; Loremn Ipsum; Tseria;25/08/07 23:16; 43?49'528S - 065?29'077E; ED50;Se +g; 4.1Tidos; Fog; 132.8?; Loremn Ipsum; Tseribow;26/06/04 16:48; 39?16'733N - 040?36'086W; WGS84 +;Seg; 10.7Tidos; Fog; 207.7?;
Update:

It's saturday, and I have good questions to ask Jack on monday, so I'm going to back burner this until I get some feedback from him. Thanks, everyone!

Final update:

He was running another piece of software that changed the output to Unicode. D'oh! But what I learned here helped to deal with that issue.


In reply to script inserts \x00 bytes on WinXP by dwhite20899

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.