merrymonk has asked for the wisdom of the Perl Monks concerning the following question:

I am processing an Excel spreadsheet which has some cells with Chinese characters in them.

The lines below show what is written to an MSDOS screen.

Wide character in print at ....system_lib.pm line 2790, <CFGIN> line 1 +62. Õ?ÅÕÅÀ Wide character in print at ....system_lib.pm line 2790, <CFGIN> line 1 +62. XÕÅûµò?õ©¡Úù?ÕÇ? Wide character in print at ....system_lib.pm line 2790, <CFGIN> line 1 +62. YÕÅûµò?õ©¡Úù?ÕÇ? Wide character in print at ....system_lib.pm line 2790, <CFGIN> line 1 +62. XÕ?ÇÕ¡ö Wide character in print at ....system_lib.pm line 2790, <CFGIN> line 1 +62 YÕ?ÇÕ¡ö Wide character in print at ....system_lib.pm line 2790, <CFGIN> line 1 +62. Þí¿ÚØóÕñäþÉå

I want to test the cell data to see if there are any Chinese characters in the cell.

What is the best way of doing this?

Replies are listed 'Best First'.
Re: Testing for Chinese Characters
by graff (Chancellor) on Jun 15, 2016 at 21:35 UTC
    Here's a command-line script I posted a long time ago: xls2tsv -- it's so old it still uses Spreadsheet::ParseExcel (i.e. it assumes the old "xls" format rather than "xlsx"), but apparently, you are already using a module that handles your particular Excel spreadsheets, so the basic point that is relevant here is:
    my $xl = Spreadsheet::ParseExcel->new; # or whatever module/version w +orks my $wb = $xl->Parse( $filepath ) or die "$filepath: $!\n"; for my $sheet ( @{$wb->{Worksheet}} ) { $sheet->{MaxRow} ||= $sheet->{MinRow}; for my $row ( $sheet->{MinRow} .. $sheet->{MaxRow} ) { $sheet->{MaxCol} ||= $sheet->{MinCol}; for my $col ( $sheet->{MinCol} .. $sheet->{MaxCol} ) { my $cell = $sheet->{Cells}[$row][$col]; my $val = $cell->{Val}; if ( $cell->{Code} eq 'ucs2' ) { $val = decode( "UTF-16BE", $val ); if ( $val =~ /\p{Han}/ ) { # this cell contains Chinese characters } # NB: there may be non-ASCII Unicode characters that a +re not Chinese } } } }
Re: Testing for Chinese Characters
by ikegami (Patriarch) on Jun 15, 2016 at 19:58 UTC

    First of all, what you call "MSDOS screen" is actually called a "Windows console", possibly running "the Windows command shell".


    The message yo are getting indicates you are printing non-bytes to a handle expecting bytes. In this case, you are printing decoded text without telling Perl to encode it. You can use the following to do that:

    use Win32 qw( ); BEGIN { binmode(STDIN, ':encoding(cp'. Win32::GetConsoleCP() .')'); binmode(STDOUT, ':encoding(cp'. Win32::GetConsoleOutputCP() .')'); binmode(STDERR, ':encoding(cp'. Win32::GetConsoleOutputCP() .')'); }

    That said, I have doubts about the ability of your console to display Chinese characters. You might need to switch to the console's code page to 65001 (by using chcp 65001) and switch the console's font. That's not without issues.