in reply to Re^2: XLSX to CSV with high ASCII characters
in thread XLSX to CSV with high ASCII characters

Hi again, I'm a bit lost now, or maybe you are.

When you say "the UTF-8 CSV is fine" you mean, I assume, that you can open the CSV file in a text editor and the characters display correctly. But as I understood it your goal was to open the CSV file in Excel and that was failing to display the characters correctly. The script I provided does that -- takes input in UTF-8, converts it to Perl internal, works on it, and outputs it in UTF-16 so Excel will display it right.

If you want something else, I've missed it.

Of course, you could always write out an .xls or .xlsx file in UTF-8 instead of CSV in UTF-16, if it's going to be used in Excel anyway, and avoid the encoding issue.


The way forward always starts with a minimal test.
  • Comment on Re^3: XLSX to CSV with high ASCII characters

Replies are listed 'Best First'.
Re^4: XLSX to CSV with high ASCII characters
by apu (Sexton) on Aug 27, 2017 at 23:21 UTC
    Sorry for the misunderstandings.
    When you say "the UTF-8 CSV is fine" you mean, I assume, that you can open the CSV file in a text editor and the characters display correctly.

    The CSV file from xls2csv looks fine, whether I just "cat" it in a terminal window, open it in a text editor or open it in Excel. The CSV from my script comes out with the wrong encodings in all three situations. The reference to opening the output in Excel, in reply to poj's sample script, was only to see if another program (besides 'cat' or a text editor) could make sense of the data.

    The source data is an Excel workbook or, more precisely, a Microsoft Excel Open XML Format Spreadsheet (XLSX) file created by a third-party website. I doubt an actual instance of Excel is being used since the file is dynamically created by a database query. After I process it, my output CSV is going to be uploaded to a different third-party website. There is no Microsoft Excel processing the output. The goal is to eliminate humans opening the spreadsheet in Excel and manipulating it at all.

    Aside: I unzip'ed the .xlsx file and looked at the raw XML. Everything in the source file is listed as UTF-8 encoding. So, I'm starting with UTF-8 and trying to end with UTF-8. (Or utf8 -- I tried both for my output as there is a difference.)