demis has asked for the wisdom of the Perl Monks concerning the following question:

I am reading from two files, one an excel->csv converted and a pure excel one and want to compare two cells. These are in greek and they truly **ARE** the same. By this I mean
if($cellvalue1 eq $cellvalue2){print "are the same\n";} else {die "not + the same\n";}
and it says they are the same Yes, when I do
$worksheet{1}_>write($row,$column1,$cellvalue1,$format); $worksheet{1}_>write($row,$column2,$cellvalue2,$format);
one prints correctly and the other one gives strange chavaters. What can be wrong here? The version is 2.20 and it's perl 5.8.9 on XP(I know, not my choice) Activestate build 826

Replies are listed 'Best First'.
Re: WriteExcel unicode question
by jmcnamara (Monsignor) on Oct 15, 2009 at 08:49 UTC

    The Spreadsheet::WriteExcel write() method will handle Unicode strings if they have the utf8 flag set.

    This is generally the case if you read in a file and specify the encoding of the file when you open it. See for example the unicode_*.pl examples in the S::WE distro and in particular unicode_8859_7.pl for Greek text.

    So, I'd guess that the strings that you are having problems with don't have the utf8 flag set. You can either fix this using the Encode module or if you let us know how you are reading this strings then we might be able to suggest a better, or at least alternative, solution.

    --
    John.

      Thanks, but I'm not sure that does it The unicode_8859_7.pl scirpt specifies how one opens an ascii file. In my case the file I get the problematic cells is an excel file itself. I should add it displays fine with the Devel:ptkdb debugging/diagnostic module I use. I use
      $cellvalue2=encode("iso08859-7",$cellvalue2);
      it makes NO difference. I even tried
      if($Cell->{Code} eq 'ucs2'){$cellvalue2=Encode::decode ('UCS-2BE',$cel +lvalue2);
      which gave again junk(though more chinese-looking junk) I read these strings by
      $oexcel=new Spreadsheet::ParseExcel;#version 0.32 from #activestate $oBook=$oexcel->Parse($myfile); for my ($iSheet=0; $iSheet<$oBook->{SheetCount};$iSheet++;} $oWkS=$oBook->{Worksehhet}[$iSheet]; for (my $iR=$oWkS=>{MinRow};defined $oWkS->{MaxRow|&& $iR <=$oWkS->{Ma +xRow};$iR++;){ for(my $iC=$oWkS->{MinCol};defined $oWkS->{MaxCol &&$iC <$oWkS->{MaxCo +l}; $iC++; my $cellvalue2=''; my $oWkC=$oWkS->{Cells}[$iR][$iC]; if($oWk){$cellvalue2=$oWkC->Value; $cellvalue2=$oWkC->{'Val'} if ($cel +lvalue2 eq 'GENERAL';} #and it IS GENERAL ....
      Like I said above, it makes no difference if you try to post-process the $cellvalue2 with  encode("iso-8859-7",$cellvalue2);

        Okay, the part we were missing is that you were reading the data with Spreadsheet::ParseExcel.

        You should be able to read and write from an Excel file without intervention. Here is a working example using the Greek file generated from the example file above:

        #!/usr/bin/perl use strict; use warnings; use Spreadsheet::WriteExcel; use Spreadsheet::ParseExcel; my $parser = Spreadsheet::ParseExcel->new(); my $in_workbook = $parser->Parse('unicode_8859_7.xls'); my $out_workbook = Spreadsheet::WriteExcel->new('newfile.xls'); my $out_worksheet = $out_workbook->add_worksheet(); for my $in_worksheet ( $in_workbook->worksheets() ) { my ( $row_min, $row_max ) = $in_worksheet->row_range(); my ( $col_min, $col_max ) = $in_worksheet->col_range(); for my $row ( $row_min .. $row_max ) { for my $col ( $col_min .. $col_max ) { my $cell = $in_worksheet->get_cell( $row, $col ); next unless $cell; $out_worksheet->write( $row, $col, $cell->value() ); } } }
        The version of Spreadsheet::ParseExcel that you are using, 0.32, is quite old. Try upgrading to the latest, 0.55.

        If that doesn't work you could try specifying an alternative parsing formatter such as S::PE::FmtUnicode or S::PE::FmtJapan (despite the name it also handles general Unicode via Encode):

        ... use Spreadsheet::ParseExcel; use Spreadsheet::ParseExcel::FmtJapan; my $parser = Spreadsheet::ParseExcel->new(); my $formatter = Spreadsheet::ParseExcel::FmtJapan->new(); my $in_workbook = $parser->Parse('unicode_8859_7.xls', $formatter) +; ...
        --
        John.

        $cellvalue2=Encode::decode ('UCS-2BE',$cellvalue2)

        Not sure whether it will solve your problem, but I would have tried 'UCS-2LE'  (I have yet to see a big-endian Windows...)

Re: WriteExcel unicode question
by biohisham (Priest) on Oct 15, 2009 at 09:48 UTC
    What if you try to read both the files in some variable, while you're opening the files you specify the Greek encoding for them in the open argument?
    Something like;
    #this is Japanese... my $file = 'unicode_8859_11.txt'; open FH, '<:encoding(iso-2022-jp)', $file or die "$!\n";
    I know you are dealing with .csv and .xls but I hope my suggestion can offer some light. Excel files are not easy to handle with the Spreadsheet::WriteExcel because they're tab-delimited, a representation of the tab can trickle in the output, this might be why the excel file is giving you troubles, I'd rather convert the file to .csv or convert both files to .txt so handling them becomes easier and such unexpected behaviors kept in check. Make sure you save a copy of the original file.

    In addition check this, they got some links to SpreadSheet examples for dealing with foreign languages in Unicode to the left of the page, it might be of use. I wish you luck...

    Another point, using the OLE module is recommended over Spreadsheet::WriteExcel for more reliability working with Excel on Windows. Update after reply from jmcnamara

    Update:Added a brief idea about Excel tab-delimiters and also about the OLE module


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

      Another point, using the OLE module is recommended over Spreadsheet::WriteExcel for more reliability working with Windows files.

      "Citation needed".

      AFAIK, there aren't any reliability problems with Spreadsheet::WriteExcel. Also, it doesn't work with "Windows files". It creates an Excel file, which is also a file format used on Mac and Linux, on any platform that Perl works on.

      --
      John.

      Same as the unicode_8859_7.pl script: The "cell" I read from csv is always fine. It's the cell from the excel file that gives me problems. Unless I am missing something and now one can open excel files with the ParseExcel module using ...filehandles???
      This was also what I was thinking about, though I did not want to give up that easily: Convert the xls file to csv. I still do not understand why although INTERNALLY the computer looks at two strings and sees them as equal, then come printing, one prints fine and the other as junk.