mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am building an application that loads data from an excel file into an Orcale database. The idea is to save the file as tab delimited (text), and simply to parse it using Perl. All the data I work with is in UTF-8. This works fine, unless the data in the excel file uses special characters (in my case Korean chars). For Example:

Korean Studies Information Service System/한국학술정보학술지원문데이터베이스

You can open an excel file and paste this example if you want to recreate the problem (not placed in code tags because the chars are encoded). If I save this as a text file, the special characters all turn into '?'. Another option is to save the file as unicode. The problem now is that the text file is encoded using utf-16 and not utf-8, and I can't load it into the DB.
I tried to convert to utf-8 using Encode, but with no success for Korean characters (although with partial success for Czech chars, so I think I might be in the right direction).
This is the code I used (test is the unicoded file, test_utf is the utf-8 encoded file):
#!usr/bin/per use strict; use warnings; use Encode qw(encode decode); open IN, "<", "test" or die; open OUT, ">", "test_utf" or die; while (my $line= <IN>){ ##unicode = utf-16 I think $line = decode('unicode', $line); $line = encode('utf-8', $line); print OUT $line; } close IN; close OUT
Any idea what might work?
Also, this is isn't strictly Perl, but if anybody has an idea how to save an excel file as utf-8 without losing special chars I will be extremely grateful
Thanks,
Guy

Man is the only animal that can remain on friendly terms with the victims he intends to eat until he eats them.
- Samuel Butler

Replies are listed 'Best First'.
Re: Getting Data from an Excel File
by almut (Canon) on Feb 27, 2008 at 12:25 UTC

    Not exactly sure I understood how you extracted the textual data from the Excel file... but for converting Windows unicode (UTF-16) plain text files into UTF-8, the following should do the trick:

    use strict; use warnings; open IN, "<:encoding(utf16le)", "test.utf16le" or die $!; open OUT, ">:encoding(utf8)", "test.utf8" or die $!; while (my $line = <IN>){ print OUT $line; } close IN; close OUT

    The idea is essentially to tell Perl what your existing input and desired output encoding is, and letting Perl do the rest.

    Update: BTW, if the input file contains a BOM (which it almost always does on Windows), it would have been sufficient to specify :encoding(utf16). Perl can figure out itself that the file is in little-endian format in this case. Interestingly though, the output file does not contain a UTF-8 BOM when doing it that way — I never really understood the reasoning behind that behaviour...  (When you convert it as shown above, however, the output file will have a BOM (presumably because it's then converted just like any other codepoint), which is recommended on Windows.)

      ++ Excellent reply.

      When you convert it as shown above, however, the output file will have a BOM (presumably because it's then converted just like any other codepoint), which is recommended on Windows.

      Recommended by whom? Microsoft Corp.?

      I don't like BOMs in UTF-8 files on any platform. A BOM in a text file that is otherwise all ASCII kills its backward compatibility with so-called "legacy" software, which is a big part of the raison d'être of the UTF-8 encoding form. In my experience, most modern applications that understand Unicode will figure out the UTF-8-ness of a BOM-less text file, whereas almost no legacy software will tolerate a BOM in an ASCII file.

      See this entry and following ones in the Unicode UTF/BOM FAQ.

      Jim

        Recommended by whom? Microsoft Corp.?

        Not sure what Microsoft's official recommendation is in this regard (if anyone knows, please share). My "is recommended" statement is just my resumé from personal experience, in particular from having worked in Japanese Windows environments for a couple of months.

        My impression there was that overall you'll run into the least problems if you always tag unicode files as such using a BOM (be they UTF-8, UTF-16 or UCS-2). Some programs will try auto-detection (with varying success), but many simply assume the file is in the default legacy encoding, if not told otherwise.  YMMV of course, depending on which applications you're primarily working with. So please take this with a grain of salt.

        I don't like BOMs in UTF-8 files on any platform...

        I personally don't like them either, in particular on Unix platforms, where they tend to create more problems than they solve. OTOH, I've gotten used to the situation that different platforms have different approaches and philosophies.  After all, with Perl in my handbag, this isn't too much of an issue anyway...

      Thanks!!
      Worked like a charm!!
Re: Getting Data from an Excel File
by szabgab (Priest) on Mar 02, 2008 at 19:52 UTC
    I am not sure it will work for you but have you tried Spreadsheet::ParseExcel? If you are on Windows, have you tried the OLE Automation that I think Win32::OLE can provide you.