JackVanRamars has asked for the wisdom of the Perl Monks concerning the following question:

In a simple script I open a MS Excel table with Slovenian characters and read them into an array. When I write text the Slovenian characters are uncorrect (c with caron turns into e with grave, other two become little outlined square).

I read data (Win32::OLE) from different Excel tables writing them into a single table comparing them and writing (Spreadsheet::WriteExcel) data comparision statuses in different colors and other attributes.

Characters from excel table are correctly read into array. When I open output table they are uncorrect.

The same characters are written correctly into output table if I define them as strings within script.

Windows XP use the cp1250, the characters are 0xc8, 0xe8, 0x8a, 0x9a, 0x8e and 0x9e (c, s, z letters with caron).

I'll appreciate any help.

Thanks.

Replies are listed 'Best First'.
Re: Slovenian characters problem
by jmcnamara (Monsignor) on Aug 12, 2010 at 15:02 UTC

    Spreadsheet::WriteExcel requires Unicode characters to be in Perl's UTF8 format. So you will need to convert them from the encoding that you are reading them in to UTF8 (perhaps using Encode) before writing them back to a WriteExcel file.

    I don't have an example for cp1250 but here is an example with cp1251.

    See also the Unicode section of the Spreadsheet::WriteExcel docs.

    --
    John.

Re: Slovenian characters problem
by graff (Chancellor) on Aug 13, 2010 at 01:09 UTC
    It sounds like most of the tools on your system are assuming a single-byte encoding (cp1250) for Slovenian text. If your perl script is somehow turning your spreadsheet data to utf8 strings, you would need to make sure it gets converted back to cp1250 so that the data will be coherent in the other tools that use (and display) the data.

    You haven't shown any code, so I'm not sure what to recommend, but somewhere you're likely going to want to do something like this:

    use Encode; ... # assuming $data is a scalar containing a utf8 string: $data = encode( "cp1250", $data ); # now it contains a cp1250 string # (utf8 flag is off, and characters are single-byte)
    (updated to show that you have to "use Encode" in order to apply the "decode()" function)

    ... or maybe something like this:

    open( OUT, ">:encoding(cp1250)", "output.txt" ) or die "output.txt: $! +\n"; # assuming $data contains a utf8 string: print OUT $data; # the string gets converted to cp1250 before being written to the file # (no need to "use Encode" in this case)
      Thanks for your hint, unfortunatelly my interpretation of it wasn't succesful.

      I apologize, I'm new to this editor so the code was "lost".

      #!/usr/bin/perl
      use strict;
      use Pod::Usage;
      use Getopt::Std;
      use Win32::OLE;
      use Spreadsheet::WriteExcel;
      use encoding "cp1250";
      use Encode;

      my $Excel;
      my $Book;
      my $Sheet;
      my @Test;
      my $Test;

      @Test=encode("cp1250",@Test);

      #Excel file FileToRead.xls contains the following Slovenian characters:
      # Line 1 Column 1: Čč (capital and small letter C with caron)
      # Line 1 Column 2: Šš (capital and small letter S with caron)
      # Line 1 Column 3: Žž (capital and small letter Z with caron)

      $Excel=Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new('Excel.Application', 'Quit');
      $Book=$Excel->Workbooks->Open('c:\batch\Pisarna\Clouseau\FileToRead.xls');
      $Sheet=$Book->Worksheets(1);

      # characters read into @Test are correct push(@Test,{
      Col1=>$Sheet->Cells(1,1)->{'Value'},
      Col2=>$Sheet->Cells(1,2)->{'Value'},
      Col3=>$Sheet->Cells(1,3)->{'Value'},
      });

      $Book->Close;
      $Excel->Close;

      my $BookOut=Spreadsheet::WriteExcel->new('c:\batch\Pisarna\Clouseau\FileWritten.xls');
      my $SheetOut=$BookOut->add_worksheet('test');

      # writing Slovenian characters directly
      $SheetOut->write(0,0,"Čč");
      $SheetOut->write(0,1,"Šš");
      $SheetOut->write(0,2,"Žž");

      # read from xls file through @Test
      $SheetOut->write(1,0,"$Test1->{Col1}");
      $SheetOut->write(1,1,$Test1->{Col2});
      $SheetOut->write(1,2,$Test1->{Col3});

      # strings defined in above lines 41, 42 and 43 are written into FileWritten.xls correct:
      # Line 1 Column 1: Čč (capital and small letter C with caron)
      # Line 1 Column 2: Šš (capital and small letter S with caron)
      # Line 1 Column 3: Žž (capital and small letter Z with caron)

      # strings read into @Test are written into FileWritten.xls incorrect
      # Line 2 Column 1: capital and small letter C with grave (can't be written)
      # Line 2 Column 2: small outlined square (can't be written)
      # Line 2 Column 3: small outlined square (can't be written)

      _END_

      In short, Slovenian characters written directly are correct but those read from xls file and written through @Test aren't correct. Something weird happens to them on their way from array through Spreadsheet::WriteExcel.

      (I work in Windows XP Professional version 2002, I use Perl 5.8). Thanks for your hint.
        When you post perl code at the Monastery, put <code> at the beginning, and put </code> at the end; this will "do the right thing" to make your code look right in the post, without having to do anything else special to the text of the perl code itself. This is explained at Markup in the Monastery (and this link is provided on the page where you submit your post).

        Regarding your code, this first use of "decode()" does nothing, because there is nothing in the array yet:

        ... my @Test; my $Test; @Test=encode("cp1250",@Test); ...
        So you must have misunderstood what I meant. Anyway, since your script includes use encoding "cp1250"; I gather that you wrote your script with a text editor that saves the file in that encoding. That should be fine, but it means that the quoted strings with accented characters are being treated internally in perl as utf8 strings (because that's what use encoding is supposed to do -- read the output of perldoc encoding).

        So if you want these characters to be stored in the Excel file as cp1250 characters, I think you need to do your "write" calls like this:

        use Encode; ... $SheetOut->write(0,0, encode( "cp1250", "&#268;&#269;" ); $SheetOut->write(0,1, encode( "cp1250", "Šš" ); $SheetOut->write(0,2, encode( "cp1250", "Žž" ); ...
        What happens in that case is: (1) your text editor saves the script as a cp1250-encoded text file, (2) when perl.exe reads the script to execute it, it sees use encoding "cp1250" and converts the special characters to its normal internal utf8 encoding (so that "character semantics" will work in the normal way), (3) then when those "write() functions are called, the Encode::encode function turns the utf8 strings back into cp1250 for storage in the Excel file.

        At least, I think that's what should happen. Give it a try.

        (update: the snippet I posted above is showing numeric character entities for some of the characters -- that was not intentional, but I'm not going to try to fix it -- you know which characters are supposed to be there.)