Unicode and text files

dirtdart has asked for the wisdom of the Perl Monks concerning the following question:

I am running the latest release of ActivePerl on Windows Server 2003 and have run into a problem that has me completly confounded. I have a plain text file that I know is Unicode. I need to open this file and see if one of the lines contains certain data. So, I:

open FILE, $myfile
while(<FILE>) {
    if(/Information Store/i) {
        $found = 1;
        last;
    }
}
[download]

Not surprisingly, it doesn't work. So I do some reading, peruse perluniintro, perlunicode and various other documentation and find that what I REALLY need to do is:

open FILE, "<:utf8", $myfile
while(<FILE>) {
    if(/Information Store/i) {
        $found = 1;
        last;
    }
}
[download]

Still doesn't work. Inside the while loop, I place a print "$_\n" and find that each line is printed with what appear to be spaces between each letter. This is the case whether I use the first open method or the second. So, just to see what Perl thinks, I insert utf8::is_unicode($_) in the while loop. Returns 1 every time no matter how I open the file. Obviously Perl believes this to be a Unicode string. I try using utf8::decode on the string to see what it turns into, only to find out that although the function returns 1 for success, the string is completly unchanged. So, in desparation, I attempt to turn the string into an array of bytes using unpack. Both unpack "U*" and unpack "C*" return the exact same character array, and in both, the "space" between characters is reported as having a value of 48. According to the ASCII character charts, this should print as a 0 (zero). Now I'm completly dumbfounded. I don't care if Perl thinks this is a Unicode string, ASCII string, EBCDIC string, or anything else it wants. I just want to be able to use a regular expression to find certain data within this string and I can't. Does anyone have any advice on how to get this accomplished?

Comment on Unicode and text files Select or Download Code

Replies are listed 'Best First'.
Re: Unicode and text files by davorg (Chancellor) on Oct 12, 2006 at 13:35 UTC
It sounds like what you have there isn't UTF-8, but UTF-16. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^2: Unicode and text files by dirtdart (Beadle) on Oct 12, 2006 at 13:49 UTC
`open FILE "<:encoding(utf16)", $myfile while(<FILE>)` [download] Produces "UTF-16:Unrecognised BOM 4a00 at backup.pl line 68." Line 68 being the while(<FILE>) line. So I'm guessing it's not UTF16. The file is produced by NTBackup if that helps any.	[reply] [d/l]
Re^3: Unicode and text files by Hue-Bond (Priest) on Oct 12, 2006 at 14:02 UTC
Try `utf16le` or `utf16be` -- David Serrano	[reply] [d/l] [select]
Re^4: Unicode and text files by dirtdart (Beadle) on Oct 12, 2006 at 14:05 UTC
Re^5: Unicode and text files by Hue-Bond (Priest) on Oct 12, 2006 at 17:44 UTC
Some notes below your chosen depth have not been shown here
Re^5: Unicode and text files by davorg (Chancellor) on Oct 13, 2006 at 07:46 UTC
Re^5: Unicode and text files by graff (Chancellor) on Oct 12, 2006 at 21:14 UTC
Re: Unicode and text files by Melly (Chaplain) on Oct 12, 2006 at 13:32 UTC
Did you add "use utf8;" to the top of your script? From the perlunicode help: "use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts As a compatibility measure, the use utf8 pragma must be explicitly included to enable recognition of UTF-8 in the Perl scripts themselves (in string or regular expression literals, or in identifier names) on ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. These are the only times when an explicit use utf8 is needed" <update>Which doesn't work for me... :(</update> Tom Melly, tom@tomandlu.co.uk	[reply]
Re^2: Unicode and text files by dirtdart (Beadle) on Oct 12, 2006 at 13:41 UTC
I hadn't been using utf8 because I didn't think that I really needed it with Perl 5.8. However, I just tried the same code with use utf8; added and nothing changed. The regular expression still doesn't match the appropriate line in the file and the line still prints as what appears to be a string of characters separated by spaces. This happens when opening the file with or without utf8 specified in the open statement.	[reply]
Re: Unicode and text files by Errto (Vicar) on Oct 13, 2006 at 02:51 UTC
I think you got the answer to your question already, but I just want to add for general information that in the Windows world, when you open a text file whose encoding is "Unicode", that probably means UTF16LE. Microsoft has an unfortunate habit of referring to this as "Unicode" even though it's not; it's just one of several possible encodings of Unicode. An easy way to determine the encoding of a text file in Windows, at least if it's one of ANSI (aka CP1252 in US-English Windows), UTF-8, UTF16BE or UTF16LE, is to open it in Notepad and look at the File->Save As window.	[reply]