pg09 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to read Arabic text from a MS Word document and write extracted text in Excel. I'm using the following code, but it seems to print out 'boxes' instead of Arabic text. Any help will be appreciated.
use Unicode::Map(); use Spreadsheet::WriteExcel; use Win32::OLE::Const 'Microsoft Word' my $workbook = Spreadsheet::WriteExcel->new("word.xls"); my $worksheet = $workbook->addworksheet(); my $word = Win32::OLE->new('Word.Application', 'Quit'); my $map = Unicode::Map->new("ISO-8859-6"); my $utf16 = $map->to_unicode($doc->Words->Item(1)->Text); $worksheet->write(0, 0, $utf16, $format2);

Replies are listed 'Best First'.
Re: writing Arabic text in Excel
by jmcnamara (Monsignor) on Apr 03, 2006 at 21:00 UTC
    The Spreadsheet::WriteExcel write() method expects unicode data to be in utf8 format (in perl5.8).

    You indicate that the data you are reading from Word is ISO-8859-6 but it may be UTF-16LE, which most Windows applications use internally.

    Either way you should try to convert it to UTF-8 instead of UTF-16 if you are using write().

    You can also write UTF-16BE and UTF-16LE data using the (poorly named) write_unicode() and write_unicode_le() methods.

    --
    John.

      Thanks for your response! However, I'm not able to install Unicode::Map8 through 'ppm' on my Windows 2000 machine. It complains: "Searching for 'Unicode::Map8' returned no results. Try a broader search first." I'm however sure the module is present in the 'C:\Perl\lib' directory. Please advise.
        I haven't gone further than the following, but googling with perl unicode map8 ppm leads to an alternative site of ppm's which seems to have something built on Aug 1, 2003. Elsewhere(?) I notice several ppm's that are built for 5.6

        This is G o o g l e's cache of http://apache.hoxt.com/perl/win32-bin/ppmpackages/ as retrieved on Feb 26, 2006 09:28:24 GMT.
        I'm not able to install Unicode::Map8 through 'ppm' on my Windows 2000 machine.

        If you are using Perl 5.8 you can use the core Encode module to convert between encodings.

        See also the perluniintro and perlunicode manpages for more information.

        --
        John.

Re: writing Arabic text in Excel
by vkon (Curate) on Apr 04, 2006 at 15:42 UTC
    First of all, you should instruct Win32::OLE to use unicode, with the following 2 lines:
    use Win32::OLE qw(CP_UTF8); Win32::OLE->Option(CP=>CP_UTF8);
    Secondly, it is not good to use obsoleted Unicode::Map module, it was used when Unicode in Perl was weak, now you should go other, the robust way, of perl5.8.x

    thirdly, boxes are probably missing characters in a given font.

    BR,
    Vadim.

      Thanks, this helped! The Arabic text gets printed fine, but the English language symbols such as (), [], ..., etc. show up in the inappropriate places. That is, these symbols show up in the left to right format (as in English) rather than right to left (as in Arabic). Following is my code:
      use Win32::OLE qw(CP_UTF8); Win32::OLE->Option(CP=>CP_UTF8); use Win32::OLE::Const 'Microsoft Word'; use Spreadsheet::WriteExcel; my $workbook = Spreadsheet::WriteExcel->new("word.xls"); my $worksheet = $workbook->addworksheet(); my $word = Win32::OLE->new('Word.Application', 'Quit'); my $doc = $word->Documents->Open("C:\\file.doc"); my $string = $doc->Words->Item(1)->Text; $worksheet->write(0, 0, $string);
        I believe your problem is in right-to-left and left-to-right mixed text...
        I can't advice many here, but I believe Word is quite good at this, so it probably deserves respect on this :):):)
Re: writing Arabic text in Excel
by pg09 (Acolyte) on Apr 10, 2006 at 20:20 UTC
    I'm extracting 'highlighted' Arabic text from MS Word and outputing it to an Excel file. To do this I'm iterating over each word in the document and checking if it is highlighted. However, this code takes way too long to finish. Is there any better way to do this? Following is the code similar to what I'm using:
    use Win32::OLE qw(CP_UTF8); Win32::OLE->Option(CP=>CP_UTF8); use Win32::OLE::Const 'Microsoft Word'; use Spreadsheet::WriteExcel; my $word = Win32::OLE->new('Word.Application', 'Quit'); my $doc = $word->Documents->Open($file); my $workbook = Spreadsheet::WriteExcel->new($out_file); my $worksheet = $workbook->addworksheet(); my $row = 0; my $col = 0; for(my $i = 1; $i <= $doc->Words->Count; $i++) { if($doc->Words->Item($i)->HighlightColorIndex > 0) { $worksheet->write($row++, $col, $doc->Words->Item($i)->Text); } }
    Thank you for any help in advance!

      I don't know if it'd be significantly quicker, but you could try using Word's 'find object' to search for the highlighted text, then loop through whatever it returns you.