Mysterious Whitespaces between each character in a file

1wax has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by almut (Canon) on Oct 08, 2009 at 15:57 UTC
(Presuming the file actually is in UCS-2le or UTF-16le encoding (which is likely) ...) If you need/want to stick with 5.6.1, you could use the following crude hack: `$/="\n\0"; while (my $line = <>) { print pack("C", map $_ & 0xff, unpack("v",$line)); }` [download] This would simply remove all the high-bytes (what appears as extra "spaces" — actually those spaces are zero bytes for all chars with ordinal value <= 0xff). As the sample text you've shown only seems to contain plain ASCII characters, this approach should work pretty well. Another option with 5.6.1 would be the module Unicode::String: `use Unicode::String qw(utf16le); $/="\n\0"; while (my $line = <>) { print utf16le($line)->latin1(); # or, if you want UTF-8 output: # print utf16le($line)->utf8(); }` [download] The problem with Unicode::String is that it doesn't ship with 5.6.1 by default, so you'd somehow have to get hold of it (for v5.6.1!), or build it yourself. OTOH, as Unicode::String is an XS module that needs a working compiler environment set up, etc., I would not recommend the latter (unless you're familiar with the procedure...). It's most likely easier to use the crude hack... (I tried both approaches with an old perl-5.6.0, so I'm pretty sure they should work with 5.6.1, too)	[reply] [d/l] [select]
Re^2: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by ikegami (Patriarch) on Oct 08, 2009 at 16:43 UTC
`$/="\n\0";` will fail if the file contains character U+0Axx followed by U+yy00 (for any values "xx" and "yy"). Also, you should replace characters outside iso-latin-1 with some fixed character (such as "?") rather than some random character. This fixes both problems: `local $/ = "\x0A\x00"; for ( my $line = ''; defined( $_ = <> ); $line = '' ) { $line .= $_; redo if length($line) % 2 != 0; print pack 'C', map { $_ <= 0xFF ? $_ : '?' } unpack 'v', $line; # -or- # print utf16le($line)->latin1(); }` [download] (Assumes each file in @ARGV is properly formed, i.e. contain an even number of bytes.)	[reply] [d/l] [select]
Re^3: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by almut (Canon) on Oct 08, 2009 at 17:08 UTC
`$/="\n\0";` will fail if the file contains character U+0Axx followed by U+yy00 This is correct (same holds for `"\x0A\x00"`, btw). However, as `U+0Axx` is Gurmukhi/Gujarati, this is rather unlikely to happen in the OP's case... (Also, a characteristic of a "crude hack" (as I called it) is, that it would work in most practical cases, but isn't failsafe, theoretically).	[reply] [d/l] [select]
Re^4: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by ikegami (Patriarch) on Oct 08, 2009 at 17:16 UTC
Re^5: Mysterious Whitespaces between each character in a file (hack for 5.6.x) by almut (Canon) on Oct 08, 2009 at 17:29 UTC
Some notes below your chosen depth have not been shown here
Re: Mysterious Whitespaces between each character in a file by ikegami (Patriarch) on Oct 08, 2009 at 14:07 UTC
The file is encoded using UCS-2le `open(my $fh, '<:encoding(UCS2-le)', $fn)` [download] You'll need 5.8 or higher for the above command. Perl 5.6 didn't support Unicode and encodings well. Keep in mind that 5.6.1 is 8.5 years old, 5.8 is no longer maintained and 5.10.1 is out. Sorry, I can't help you with a 5.6 solution. Update: Added last paragraph	[reply] [d/l]
Re: Mysterious Whitespaces between each character in a file by Unforgiven (Hermit) on Oct 08, 2009 at 14:07 UTC
Have you tried printing out the file and looking at it with a hex editor? Maybe it'll help knowing exactly what that character is, then you could try tracking down where it's coming from (or just regex it out).	[reply]
Re^2: Mysterious Whitespaces between each character in a file by shawnhcorey (Friar) on Oct 08, 2009 at 16:18 UTC
Agreed. One possibility is that is contains a non-breaking space (ASCII code A0). /\s/ does not match this. Looking at the data with a hex editor will tell you if this is so.	[reply]
Re: Mysterious Whitespaces between each character in a file by Anonymous Monk on Oct 08, 2009 at 13:41 UTC
it is probably UCS2 file	[reply]