in reply to Getting Data from an Excel File

Not exactly sure I understood how you extracted the textual data from the Excel file... but for converting Windows unicode (UTF-16) plain text files into UTF-8, the following should do the trick:

use strict; use warnings; open IN, "<:encoding(utf16le)", "test.utf16le" or die $!; open OUT, ">:encoding(utf8)", "test.utf8" or die $!; while (my $line = <IN>){ print OUT $line; } close IN; close OUT

The idea is essentially to tell Perl what your existing input and desired output encoding is, and letting Perl do the rest.

Update: BTW, if the input file contains a BOM (which it almost always does on Windows), it would have been sufficient to specify :encoding(utf16). Perl can figure out itself that the file is in little-endian format in this case. Interestingly though, the output file does not contain a UTF-8 BOM when doing it that way — I never really understood the reasoning behind that behaviour...  (When you convert it as shown above, however, the output file will have a BOM (presumably because it's then converted just like any other codepoint), which is recommended on Windows.)

Replies are listed 'Best First'.
Re^2: Getting Data from an Excel File
by Jim (Curate) on Feb 27, 2008 at 19:19 UTC
    ++ Excellent reply.

    When you convert it as shown above, however, the output file will have a BOM (presumably because it's then converted just like any other codepoint), which is recommended on Windows.

    Recommended by whom? Microsoft Corp.?

    I don't like BOMs in UTF-8 files on any platform. A BOM in a text file that is otherwise all ASCII kills its backward compatibility with so-called "legacy" software, which is a big part of the raison d'être of the UTF-8 encoding form. In my experience, most modern applications that understand Unicode will figure out the UTF-8-ness of a BOM-less text file, whereas almost no legacy software will tolerate a BOM in an ASCII file.

    See this entry and following ones in the Unicode UTF/BOM FAQ.

    Jim

      Recommended by whom? Microsoft Corp.?

      Not sure what Microsoft's official recommendation is in this regard (if anyone knows, please share). My "is recommended" statement is just my resumé from personal experience, in particular from having worked in Japanese Windows environments for a couple of months.

      My impression there was that overall you'll run into the least problems if you always tag unicode files as such using a BOM (be they UTF-8, UTF-16 or UCS-2). Some programs will try auto-detection (with varying success), but many simply assume the file is in the default legacy encoding, if not told otherwise.  YMMV of course, depending on which applications you're primarily working with. So please take this with a grain of salt.

      I don't like BOMs in UTF-8 files on any platform...

      I personally don't like them either, in particular on Unix platforms, where they tend to create more problems than they solve. OTOH, I've gotten used to the situation that different platforms have different approaches and philosophies.  After all, with Perl in my handbag, this isn't too much of an issue anyway...

Re^2: Getting Data from an Excel File
by mrguy123 (Hermit) on Feb 27, 2008 at 12:40 UTC
    Thanks!!
    Worked like a charm!!