in reply to Re: XLSX to CSV with high ASCII characters
in thread XLSX to CSV with high ASCII characters

Thanks, 1nickt. Thinking this through to see if I can use that new info to my benefit. If not, its definitely useful info for the future. That said, technically Excel is not involved. The source XLSX file comes from one website. I have to process it into a CSV file for upload to a different website. So the UTF-8 CSV is fine. I just can't figure out why the encoding in my output file gets munged when the same Perl modules produce good UTF-8 output in another use case.

If I have to, I'll just pre-process all the XLSX files with xls2csv and then use the CSV files as the source data from my script. I was just trying to avoid that extra step. Since xls2csv is just a Perl script itself, I should be able to do all the work inside one script.

  • Comment on Re^2: XLSX to CSV with high ASCII characters

Replies are listed 'Best First'.
Re^3: XLSX to CSV with high ASCII characters
by Anonymous Monk on Aug 27, 2017 at 21:49 UTC
    Just glancing at xls2csv, I see that it uses Spreadsheet::ParseExcel. Your script uses Spreadsheet::Read, which uses Spreadsheet::ParseXLSX, which is a front end for Spreadsheet::ParseExcel. Have you tried using ParseExcel directly without all the middlemen, like xls2csv does?

      I have not tried the other modules but I'm willing to give it a try. For what it's worth, though, the xls2csv I have does use Spreadsheet::Read. I believe it was installed when Spreadsheet::Read was installed via CPAN. (It uses Text::CSV_XS instead of Text::CSV and \r\n instead of \n but I did try those already with no change in my output.)

      # xls2csv: Convert Microsoft Excel spreadsheet to CSV # (m)'17 [07-06-2017] Copyright H.M.Brand 2008-2017 [...] our $VERSION = "3.3"; [...] use Text::CSV_XS; use Spreadsheet::Read qw( ReadData row );
        Ok, so you're running the example script that comes with Spreadsheet::Read, not xls2csv. Now, are you using the --all option? Without that option, xls2csv is simply a front end for xlscat.
Re^3: XLSX to CSV with high ASCII characters
by 1nickt (Canon) on Aug 27, 2017 at 22:30 UTC

    Hi again, I'm a bit lost now, or maybe you are.

    When you say "the UTF-8 CSV is fine" you mean, I assume, that you can open the CSV file in a text editor and the characters display correctly. But as I understood it your goal was to open the CSV file in Excel and that was failing to display the characters correctly. The script I provided does that -- takes input in UTF-8, converts it to Perl internal, works on it, and outputs it in UTF-16 so Excel will display it right.

    If you want something else, I've missed it.

    Of course, you could always write out an .xls or .xlsx file in UTF-8 instead of CSV in UTF-16, if it's going to be used in Excel anyway, and avoid the encoding issue.


    The way forward always starts with a minimal test.
      Sorry for the misunderstandings.
      When you say "the UTF-8 CSV is fine" you mean, I assume, that you can open the CSV file in a text editor and the characters display correctly.

      The CSV file from xls2csv looks fine, whether I just "cat" it in a terminal window, open it in a text editor or open it in Excel. The CSV from my script comes out with the wrong encodings in all three situations. The reference to opening the output in Excel, in reply to poj's sample script, was only to see if another program (besides 'cat' or a text editor) could make sense of the data.

      The source data is an Excel workbook or, more precisely, a Microsoft Excel Open XML Format Spreadsheet (XLSX) file created by a third-party website. I doubt an actual instance of Excel is being used since the file is dynamically created by a database query. After I process it, my output CSV is going to be uploaded to a different third-party website. There is no Microsoft Excel processing the output. The goal is to eliminate humans opening the spreadsheet in Excel and manipulating it at all.

      Aside: I unzip'ed the .xlsx file and looked at the raw XML. Everything in the source file is listed as UTF-8 encoding. So, I'm starting with UTF-8 and trying to end with UTF-8. (Or utf8 -- I tried both for my output as there is a difference.)