dirtdart has asked for the wisdom of the Perl Monks concerning the following question:

I am running the latest release of ActivePerl on Windows Server 2003 and have run into a problem that has me completly confounded. I have a plain text file that I know is Unicode. I need to open this file and see if one of the lines contains certain data. So, I:
open FILE, $myfile while(<FILE>) { if(/Information Store/i) { $found = 1; last; } }
Not surprisingly, it doesn't work. So I do some reading, peruse perluniintro, perlunicode and various other documentation and find that what I REALLY need to do is:
open FILE, "<:utf8", $myfile while(<FILE>) { if(/Information Store/i) { $found = 1; last; } }
Still doesn't work. Inside the while loop, I place a print "$_\n" and find that each line is printed with what appear to be spaces between each letter. This is the case whether I use the first open method or the second. So, just to see what Perl thinks, I insert utf8::is_unicode($_) in the while loop. Returns 1 every time no matter how I open the file. Obviously Perl believes this to be a Unicode string. I try using utf8::decode on the string to see what it turns into, only to find out that although the function returns 1 for success, the string is completly unchanged. So, in desparation, I attempt to turn the string into an array of bytes using unpack. Both unpack "U*" and unpack "C*" return the exact same character array, and in both, the "space" between characters is reported as having a value of 48. According to the ASCII character charts, this should print as a 0 (zero). Now I'm completly dumbfounded. I don't care if Perl thinks this is a Unicode string, ASCII string, EBCDIC string, or anything else it wants. I just want to be able to use a regular expression to find certain data within this string and I can't. Does anyone have any advice on how to get this accomplished?

Replies are listed 'Best First'.
Re: Unicode and text files
by davorg (Chancellor) on Oct 12, 2006 at 13:35 UTC

    It sounds like what you have there isn't UTF-8, but UTF-16.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      open FILE "<:encoding(utf16)", $myfile while(<FILE>)

      Produces "UTF-16:Unrecognised BOM 4a00 at backup.pl line 68." Line 68 being the while(<FILE>) line. So I'm guessing it's not UTF16.

      The file is produced by NTBackup if that helps any.

        Try utf16le or utf16be

        --
        David Serrano

Re: Unicode and text files
by Melly (Chaplain) on Oct 12, 2006 at 13:32 UTC

    Did you add "use utf8;" to the top of your script?

    From the perlunicode help:
    "use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts As a compatibility measure, the use utf8 pragma must be explicitly included to enable recognition of UTF-8 in the Perl scripts themselves (in string or regular expression literals, or in identifier names) on ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. These are the only times when an explicit use utf8 is needed"

    <update>Which doesn't work for me... :(</update>

    Tom Melly, tom@tomandlu.co.uk
      I hadn't been using utf8 because I didn't think that I really needed it with Perl 5.8. However, I just tried the same code with use utf8; added and nothing changed. The regular expression still doesn't match the appropriate line in the file and the line still prints as what appears to be a string of characters separated by spaces. This happens when opening the file with or without utf8 specified in the open statement.
Re: Unicode and text files
by Errto (Vicar) on Oct 13, 2006 at 02:51 UTC

    I think you got the answer to your question already, but I just want to add for general information that in the Windows world, when you open a text file whose encoding is "Unicode", that probably means UTF16LE. Microsoft has an unfortunate habit of referring to this as "Unicode" even though it's not; it's just one of several possible encodings of Unicode.

    An easy way to determine the encoding of a text file in Windows, at least if it's one of ANSI (aka CP1252 in US-English Windows), UTF-8, UTF16BE or UTF16LE, is to open it in Notepad and look at the File->Save As window.