Recognizing tabs in unicode files

SamCG has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have a fairly simply loop of the type

    open FILE, $_;
        while(<FIL>){
            my @row = split "\t";
            print @row;
...}
[download]

which has suddenly gone horribly wrong. Instead of separating my data into nice neat little array, like

@row = qw(This is tab data)

It started hacking the heck out of it!

@row = qw(T h i s i s t a b d a t a)

Looking at the file that comes to me suggests that it's unicode, and that my split is for some reason failing to see any tabs in the data. I'd really like to use the file and process it automatically, rather than having to resave it as a normal tab-delimited file. Surely someone knows how to work around this issue, and split these files in some relatively straightforward manner?

Comment on Recognizing tabs in unicode files Download Code

Replies are listed 'Best First'.
Re: Recognizing tabs in unicode files by Celada (Monk) on Dec 08, 2005 at 23:14 UTC
When you say Unicode, what encoding do you mean? The most popular and convenient Unicode encoding, UTF-8, and the one Perl likes to use, should be robust against this. Since you are testing for an ASCII character, even if there is non ASCII Unicode data in UTF-8, it would ne passed right through. If you are dealing with another character encoding altogether, you should use a PerlIO layer to tell Perl what character encoding the data coming in are in. `# If your data are in the UTF-32, which is not ASCII compatible at all +! binmode FILE, ":encoding(UTF-32)";` [download]	[reply] [d/l]
Re: Recognizing tabs in unicode files by BorgCopyeditor (Friar) on Dec 08, 2005 at 23:28 UTC
One space between each character suggests UTF-16 being treated as ASCII. As the above poster recommends, you might try a different file mode. BCE --Your punctuation skills are insufficient!	[reply]
Re: Recognizing tabs in unicode files by SamCG (Hermit) on Dec 08, 2005 at 23:40 UTC
binmode FILE, ":encoding(UTF-16)" Works great! Thanks!	[reply]