I think you'll want to try Spreadsheet::ParseExcel instead of DBI. When you drill down to individual cell contents, you'll be able to check whether the cell value has its "Code" attribute set to "ucs2", and in that case, use the "decode()" function from Encode to convert from UTF16LE to utf8. (M$ Excel alternates between ucs2 and "native" encodings on a cell by cell basis.)
Interestingly, when I run the "dmpExR.pl" sample script that comes with that module, it seems to automagically convert the characters to utf8 on my macosx.
Or, if you prefer to stick with DBI, just do something like this for each cell value:
use Encode;
...
if ( $cellValue =~ /(?:[\x06].)+/ ) {
if ( $cellValue =~ /(?:.\x06)+/ ) {
$cellValue = decode( "UTF-16LE", $cellValue );
# now it's utf8
}
...
The "\x06" would work if the ucs2 content is Arabic, because all Arabic characters are in the range U+0600 - U+06FF.
UPDATE: Sorry -- I just noticed that you are still using perl 5.6; you really seriously should consider upgrading (5.8.8 is current at the moment). Working with Arabic or other Unicode stuff in 5.6 strikes me as a bad idea. BTW, regarding those letters you quoted in your sample data: there probably are "\x06" bytes next to them, as well as pairs of "\x06" followed by "some other non-displayable byte value". |