apu has asked for the wisdom of the Perl Monks concerning the following question:

My input file is an Excel workbook (one worksheet) with names, phone numbers, etc. I'm trying to process this into a CSV file.

xls2csv keeps the high ASCII characters (Nicolás), so I know this is possible, but I need to manipulate the data before generating the CSV (for example: if Preferred Name is provided, use that, otherwise use First Name, or ignore some columns). However, if I process the XLSX file myself, I'm loosing the high ASCII characters.

Comparing my script to xls2csv, I'm at a loss for what I missed. Can a Monk help?

#!/usr/bin/perl use Spreadsheet::Read qw(ReadData); my $inputFile = 'foo.xlsx'; use Text::CSV; my $outputFile = 'bar.csv'; my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } ) or die "Cannot use CSV: ".Text::CSV->error_diag (); my @writeRows; my $book = ReadData ($inputFile); my @readRows = Spreadsheet::Read::rows($book->[1]); foreach my $i (1 .. scalar @readRows) { my @thisRow; my ($id, $last, $first, $pref) = @{$readRows[$i-1]}; if ($pref) { push @thisRow, $pref; } else { push @thisRow, $first; } push @thisRow, $last, "Static", "Text"; push @writeRows, [@thisRow]; } open $fh, ">:encoding(utf-8)", $outputFile or die "$outputFile: $!"; $csv->print ($fh, $_) for @writeRows; close $fh or die "$outputFile: $!"; exit;

Replies are listed 'Best First'.
Re: XLSX to CSV with high ASCII characters
by poj (Abbot) on Aug 26, 2017 at 18:44 UTC

    Does this work ?

    #!/usr/bin/perl use strict; use Text::CSV; my $outputFile = 'bar.csv'; my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } ) or die "Cannot use CSV: ".Text::CSV->error_diag (); open my $fh, ">:encoding(utf-8)", $outputFile or die "$outputFile: $!"; $csv->print ($fh, ['Nicolás',123]); close $fh or die "$outputFile: $!";
    poj
      That outputs
      Nicolás,123

        It that what it shows in Excel ?

        poj
Re: XLSX to CSV with high ASCII characters
by 1nickt (Canon) on Aug 27, 2017 at 18:18 UTC

    Hi, I found a little more time to look at your problem. I was recently dealing with a related issue.

    There's a well-known bug in MS Excel's handling of character encodings in CSV files. You can read all about in on StackOverflow and elsewhere. The short answer is that you need to encode your CSV file not as UTF-8 but as UTF-16LE in order for Excel to recognize it properly.

    Here is a cleaned-up version of your script. For input I used an .xlsx file containing the data you showed below. This creates a CSV file encoded in UTF-16, so you won't be able to read it as a text file in your terminal if the terminal is set to UTF-8. But Excel will read the file and display the contents correctly.

    use strict; use warnings; use autodie; use Spreadsheet::Read qw/ ReadData rows /; use Text::CSV; my $outputFile = '1198074.csv'; my $inputFile = '1198074.xlsx'; my $book = ReadData( $inputFile ); my @writeRows; for my $row ( rows( $book->[1] ) ) { my ( $id, $last, $first, $pref ) = @{ $row }; push @writeRows, [ $pref || $first, $last, 'Static', 'Text' ]; } open my $fh, '>:encoding(UTF-16LE)', $outputFile; my $csv = Text::CSV->new( { binary => 1, eol => "\n", auto_diag => 1 } + ); $csv->print( $fh, $_ ) for @writeRows; close $fh; __END__

    Hope this helps!


    The way forward always starts with a minimal test.

      Thanks, 1nickt. Thinking this through to see if I can use that new info to my benefit. If not, its definitely useful info for the future. That said, technically Excel is not involved. The source XLSX file comes from one website. I have to process it into a CSV file for upload to a different website. So the UTF-8 CSV is fine. I just can't figure out why the encoding in my output file gets munged when the same Perl modules produce good UTF-8 output in another use case.

      If I have to, I'll just pre-process all the XLSX files with xls2csv and then use the CSV files as the source data from my script. I was just trying to avoid that extra step. Since xls2csv is just a Perl script itself, I should be able to do all the work inside one script.

        Just glancing at xls2csv, I see that it uses Spreadsheet::ParseExcel. Your script uses Spreadsheet::Read, which uses Spreadsheet::ParseXLSX, which is a front end for Spreadsheet::ParseExcel. Have you tried using ParseExcel directly without all the middlemen, like xls2csv does?

        Hi again, I'm a bit lost now, or maybe you are.

        When you say "the UTF-8 CSV is fine" you mean, I assume, that you can open the CSV file in a text editor and the characters display correctly. But as I understood it your goal was to open the CSV file in Excel and that was failing to display the characters correctly. The script I provided does that -- takes input in UTF-8, converts it to Perl internal, works on it, and outputs it in UTF-16 so Excel will display it right.

        If you want something else, I've missed it.

        Of course, you could always write out an .xls or .xlsx file in UTF-8 instead of CSV in UTF-16, if it's going to be used in Excel anyway, and avoid the encoding issue.


        The way forward always starts with a minimal test.
No such thing as high ASCII characters
by Your Mother (Archbishop) on Aug 27, 2017 at 20:51 UTC

    I don’t have help, looks like you got some that might work for you. I do have a couple of observations. The xls2csv is not using ReadData. Perhaps you should manually read.

    There is no such thing as high ASCII. It’s ASCII or not and ASCII has no support for diacritics. So you’re almost certainly working with Latin-1, CP-1252, UTF-8, or perhaps should be trying to use UTF-16LE as 1nickt suggests.

    All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

      The xls2csv the OP refers to either executes xlscat (which uses ReadData) or uses ReadData directly when using the -a option.

      Maybe the OP just wants the --utf-8 (or -U) option to either of the two tools. (xls2csv passes on the options to xlscat)


      Enjoy, Have FUN! H.Merijn
Re: XLSX to CSV with high ASCII characters
by Anonymous Monk on Aug 27, 2017 at 00:56 UTC
    If you're reading binary you should be writing binary ; you dont encode binary; you dont encode output without decoding input first

      I appreciate that. I'm just not sure what I'm missing. Doesn't Spreadsheet::ReadData take care of the binary read? What I'm missing eludes me as I continue to compare my code to xls2csv which does read and write the accented characters properly.