in reply to script inserts \x00 bytes on WinXP

I don't think it's a UTF8 issue...

You're probably right; it looks more like an UTF-16 (a.k.a. Windows Unicode) issue. See Encode and Encode::Supported.

Try the following in your script:

use Encode qw(from_to); ... from_to( $_, 'UTF-16LE', 'latin-1') for @sorted; print FOUT join("\n", @sorted);

Change the 'latin-1' to the actual encoding you want.

Replies are listed 'Best First'.
Re^2: script inserts \x00 bytes on WinXP
by ikegami (Patriarch) on Sep 05, 2008 at 22:37 UTC
    I think you meant that as a debugging tool? Why else would you do the conversion so late. Here's what the final code should probably look like:
    ... my $content; { open(my $fin, '<:encoding(UTF-16le)', $opt_i) or die "$0 : cannot read input file \"$opt_i\"\n"; local $/; $content = <$fin>; } ... { open(my $fout, '>:encoding(iso-latin-1)', $opt_o) or die "$0 : cannot write output file \"$opt_o\"\n"; print $fout join("\n", @sorted); }
      the output file contains one huge line of

      \x{4445}\x{3035}\x{533b}\x{676f}\x{203b}\x{2e34}\x{4e37}\x{6475}\x{736f}\x{203b}\x{6f43}

      Seriously, an "od -c" of the output file shows this:

      0000000 \ x { 7 6 4 5 } \ x { 6 e 6 5 + } 0000020 \ x { 6 f 7 4 } \ x { 4 2 2 0 + } 0000040 \ x { 7 1 7 5 } \ x { 6 5 7 5 + } 0000060 \ x { 2 0 3 b } \ x { 5 f 5 2 + } 0000100 \ x { 7 5 5 4 } \ x { 6 2 7 2 + }

      I actually got my hands on a WinXP machine running ActivePerl and ran the script with this code on XP, and got the above output.

      Horrible thing is (for my friend) when I run my original script on XP, it works like I expect - NOT producing the \x00 bytes! I don't understand this, and I can't replicate it, and I can't visit him (10 time zones away) to see WTF is going on.

        It's not Perl. The input file he used is different than yours, the output file he gave you is not the output he really got, or the script he used is different. The most likely explanation is that somewhere before or after the script was run, the input or output file was converted to UTF-16.
Re^2: script inserts \x00 bytes on WinXP
by dwhite20899 (Friar) on Sep 06, 2008 at 01:45 UTC
    I just get lines of question marks (joined by \n) as output.
    ??????????????????????????????????????????????? ???????????????????????????????????????????????? ????????????????????????????????????????????????
        Anon - od -c shows the file is literally full of only "?" chars and a few "\n" chars.