in reply to Mysterious Whitespaces between each character in a file

(Presuming the file actually is in UCS-2le or UTF-16le encoding (which is likely) ...)

If you need/want to stick with 5.6.1, you could use the following crude hack:

$/="\n\0"; while (my $line = <>) { print pack("C*", map $_ & 0xff, unpack("v*",$line)); }

This would simply remove all the high-bytes (what appears as extra "spaces" — actually those spaces are zero bytes for all chars with ordinal value <= 0xff).  As the sample text you've shown only seems to contain plain ASCII characters, this approach should work pretty well.

Another option with 5.6.1 would be the module Unicode::String:

use Unicode::String qw(utf16le); $/="\n\0"; while (my $line = <>) { print utf16le($line)->latin1(); # or, if you want UTF-8 output: # print utf16le($line)->utf8(); }

The problem with Unicode::String is that it doesn't ship with 5.6.1 by default, so you'd somehow have to get hold of it (for v5.6.1!), or build it yourself. OTOH, as Unicode::String is an XS module that needs a working compiler environment set up, etc., I would not recommend the latter (unless you're familiar with the procedure...). It's most likely easier to use the crude hack...

(I tried both approaches with an old perl-5.6.0, so I'm pretty sure they should work with 5.6.1, too)

Replies are listed 'Best First'.
Re^2: Mysterious Whitespaces between each character in a file (hack for 5.6.x)
by ikegami (Patriarch) on Oct 08, 2009 at 16:43 UTC

    $/="\n\0"; will fail if the file contains character U+0Axx followed by U+yy00 (for any values "xx" and "yy").

    Also, you should replace characters outside iso-latin-1 with some fixed character (such as "?") rather than some random character.

    This fixes both problems:

    local $/ = "\x0A\x00"; for ( my $line = ''; defined( $_ = <> ); $line = '' ) { $line .= $_; redo if length($line) % 2 != 0; print pack 'C*', map { $_ <= 0xFF ? $_ : '?' } unpack 'v*', $line; # -or- # print utf16le($line)->latin1(); }

    (Assumes each file in @ARGV is properly formed, i.e. contain an even number of bytes.)

      $/="\n\0"; will fail if the file contains character U+0Axx followed by U+yy00

      This is correct (same holds for "\x0A\x00", btw).  However, as U+0Axx is Gurmukhi/Gujarati, this is rather unlikely to happen in the OP's case... (Also, a characteristic of a "crude hack" (as I called it) is, that it would work in most practical cases, but isn't failsafe, theoretically).

        same holds for "\x0A\x00", btw

        Yes. It's not the (purely æsthetic) switch from \n to \x0A that solves the problem, it's the redo.

        a characteristic of a "crude hack" (as I called it) is, that it would work in most practical cases

        Yes, which is why I wouldn't have mentioned it if I hadn't already been replying about map $_ & 0xff, being worthless (harmful?).