SamCG has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have a fairly simply loop of the type
open FILE, $_; while(<FIL>){ my @row = split "\t"; print @row; ...}
which has suddenly gone horribly wrong. Instead of separating my data into nice neat little array, like

@row = qw(This is tab data)

It started hacking the heck out of it!

@row = qw(T h i s i s t a b d a t a)

Looking at the file that comes to me suggests that it's unicode, and that my split is for some reason failing to see any tabs in the data. I'd really like to use the file and process it automatically, rather than having to resave it as a normal tab-delimited file. Surely someone knows how to work around this issue, and split these files in some relatively straightforward manner?

Replies are listed 'Best First'.
Re: Recognizing tabs in unicode files
by Celada (Monk) on Dec 08, 2005 at 23:14 UTC

    When you say Unicode, what encoding do you mean? The most popular and convenient Unicode encoding, UTF-8, and the one Perl likes to use, should be robust against this. Since you are testing for an ASCII character, even if there is non ASCII Unicode data in UTF-8, it would ne passed right through.

    If you are dealing with another character encoding altogether, you should use a PerlIO layer to tell Perl what character encoding the data coming in are in.

    # If your data are in the UTF-32, which is not ASCII compatible at all +! binmode FILE, ":encoding(UTF-32)";
Re: Recognizing tabs in unicode files
by BorgCopyeditor (Friar) on Dec 08, 2005 at 23:28 UTC
    One space between each character suggests UTF-16 being treated as ASCII. As the above poster recommends, you might try a different file mode.

    BCE
    --Your punctuation skills are insufficient!

Re: Recognizing tabs in unicode files
by SamCG (Hermit) on Dec 08, 2005 at 23:40 UTC
    binmode FILE, ":encoding(UTF-16)" Works great! Thanks!