Re: Mysterious Whitespaces between each character in a file (hack for 5.6.x)

(Presuming the file actually is in UCS-2le or UTF-16le encoding (which is likely) ...)

If you need/want to stick with 5.6.1, you could use the following crude hack:

$/="\n\0";

while (my $line = <>) {
    print pack("C*", map $_ & 0xff, unpack("v*",$line));
}
[download]

This would simply remove all the high-bytes (what appears as extra "spaces" — actually those spaces are zero bytes for all chars with ordinal value <= 0xff). As the sample text you've shown only seems to contain plain ASCII characters, this approach should work pretty well.

Another option with 5.6.1 would be the module Unicode::String:

use Unicode::String qw(utf16le);

$/="\n\0";

while (my $line = <>) {
    print utf16le($line)->latin1();

    # or, if you want UTF-8 output:
    # print utf16le($line)->utf8();
}
[download]

The problem with Unicode::String is that it doesn't ship with 5.6.1 by default, so you'd somehow have to get hold of it (for v5.6.1!), or build it yourself. OTOH, as Unicode::String is an XS module that needs a working compiler environment set up, etc., I would not recommend the latter (unless you're familiar with the procedure...). It's most likely easier to use the crude hack...

(I tried both approaches with an old perl-5.6.0, so I'm pretty sure they should work with 5.6.1, too)

Comment on Re: Mysterious Whitespaces between each character in a file (hack for 5.6.x) Select or Download Code

Replies are listed 'Best First'.
Re^2: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by ikegami (Patriarch) on Oct 08, 2009 at 16:43 UTC
`$/="\n\0";` will fail if the file contains character U+0Axx followed by U+yy00 (for any values "xx" and "yy"). Also, you should replace characters outside iso-latin-1 with some fixed character (such as "?") rather than some random character. This fixes both problems: `local $/ = "\x0A\x00"; for ( my $line = ''; defined( $_ = <> ); $line = '' ) { $line .= $_; redo if length($line) % 2 != 0; print pack 'C', map { $_ <= 0xFF ? $_ : '?' } unpack 'v', $line; # -or- # print utf16le($line)->latin1(); }` [download] (Assumes each file in @ARGV is properly formed, i.e. contain an even number of bytes.)	[reply] [d/l] [select]
Re^3: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by almut (Canon) on Oct 08, 2009 at 17:08 UTC
`$/="\n\0";` will fail if the file contains character U+0Axx followed by U+yy00 This is correct (same holds for `"\x0A\x00"`, btw). However, as `U+0Axx` is Gurmukhi/Gujarati, this is rather unlikely to happen in the OP's case... (Also, a characteristic of a "crude hack" (as I called it) is, that it would work in most practical cases, but isn't failsafe, theoretically).	[reply] [d/l] [select]
Re^4: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by ikegami (Patriarch) on Oct 08, 2009 at 17:16 UTC
same holds for "\x0A\x00", btw Yes. It's not the (purely æsthetic) switch from \n to \x0A that solves the problem, it's the `redo`. a characteristic of a "crude hack" (as I called it) is, that it would work in most practical cases Yes, which is why I wouldn't have mentioned it if I hadn't already been replying about `map $_ & 0xff,` being worthless (harmful?).	[reply] [d/l] [select]
Re^5: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by almut (Canon) on Oct 08, 2009 at 17:29 UTC
Re^6: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by ikegami (Patriarch) on Oct 08, 2009 at 17:36 UTC