AALB has asked for the wisdom of the Perl Monks concerning the following question:

I'm a BI developer working with perl scripts as my ETL - I receive data over email, take the file, parse it and push it into the DB. Most of the files are CSV, but occasionally I have an XLSX file.

I've been using Spreadsheet::XLSX to convert, but I've noticed that the CSV output comes out with the wrong encoding (needs to be UTF8, because accents and foreign languages).

That's the sub I'm using ($input_file is an Excel file), but I keep getting the data with the wrong characters.

WHAT am I missing?
Thanks a lot all!

sub convert_to_csv { my $input_file = $_[0]; my ( $filename, $extension ) = split( '\.', $input_file ); open( format_file, ">:**encoding(utf-8)**", "$filename.csv" ) or d +ie "could not open out file $!\n"; my $excel = Spreadsheet::XLSX->new($input_file); my $line; foreach my $sheet ( @{ $excel->{Worksheet} } ) { #printf( "Sheet: %s\n", $sheet->{Name} ); $sheet->{MaxRow} ||= $sheet->{MinRow}; foreach my $row ( $sheet->{MinRow} .. $sheet->{MaxRow} ) { $sheet->{MaxCol} ||= $sheet->{MinCol}; foreach my $col ( $sheet->{MinCol} .. $sheet->{MaxCol} ) { my $cell = $sheet->{Cells}[$row][$col]; if ($cell) { my $trimcell; $trimcell = $cell->value(); print STDERR "cell: $trimcell\n"; ## Just for the +tests so I don't have to open the file to see if it's ok $trimcell =~ s/^\s+|\s+$//g; ## Just to make sure + I don't have extra spaces $line .= "\"" . $trimcell . "\","; } } chomp($line); if ($line =~ /Grand Total/){} ##customized for the files else { print format_file "$line\n"; $line = ''; } } } close format_file; }

Replies are listed 'Best First'.
Re: Converting XLSX to CSV with Perl while maintaining the encoding
by haukex (Archbishop) on Feb 25, 2020 at 08:05 UTC
    I've been using Spreadsheet::XLSX to convert

    Note that according to Spreadsheet::Read, that module is strongly discouraged in favor of Spreadsheet::ParseXLSX. I'd suggest you switch to Spreadsheet::Read, it also includes an example xls2csv and xlsx2csv.

    In this case, at least on a test file I have, the issue is that the strings returned from $cell->value() are undecoded UTF-8, and the quick fix is to add utf8::decode($trimcell); right after my $trimcell = $cell->value();.

Re: Converting XLSX to CSV with Perl while maintaining the encoding
by Tux (Canon) on Feb 25, 2020 at 13:39 UTC

    Other than the already given advice to use Spreadsheet::ParseXLSX, your code is wrong in that it generated invalid CSV if any of your cells contains " characters.

    Consider using Text::CSV_XS and/or Text::CSV for the generation of valid CSV.

    Even if your current speadsheeft does not hit this problem, your approach is likely to generate invalid CSV in other cases too.


    Enjoy, Have FUN! H.Merijn
Re: Converting XLSX to CSV with Perl while maintaining the encoding
by IB2017 (Pilgrim) on Feb 25, 2020 at 09:52 UTC

    As an alternative, I use Spreadsheet::ParseXLSX with no problem with UTF-8, using simply open(my $fh, '>:encoding(UTF-8)', $file).