Convert XLSX to TSV and remove CRLF in cells

bulrush has asked for the wisdom of the Perl Monks concerning the following question:

Config

Perl 5.20.1 on Redhad Enterprise Linux 5.5.56.
Using modules:
- use Text::Iconv; # Required by Spreadsheet::XLSX
- use Spreadsheet::XLSX::Reader::LibXML; # For reading XLSX
- use Excel::Writer::XLSX;
Input XLSX file is 100,000 lines long. I need to read the XLSX file, process it, and write out another XLSX file.
I will get a new XLSX file about every 2-3 weeks for years to come so manual intervention should be minimized.

The problem

I tried using Spreadsheet::XLSX::Reader::LibXML to read the spreadsheet, but it takes 2.5 hours to read the whole thing. So I decided to manually convert the XLSX file to a tab separated file (TSV) and read the TSV instead. It's faster reading the TSV but this also doesn't work because some cells contain embedded carriage returns (which Excel allows) but which messes up the TSV file. So a single row in the XLSX file appears as multiple rows in the TSV file. Like this:

col1<tab>col2<tab>col3 with
embedded returns<tab>col4<tab>col5
[download]

So when I read the CSV file I read it directly into an array like this:

# Open file here
@csv=<$CSVFILE>;
chomp(@csv);
[download]

I simply don't have time to go through 100,000 lines each time I get a new XLSX file to fix broken lines with embedded CRLF.

Questions

Is there a way to speed up reading the XLSX file directly?
Or is there a Linux util that can convert an XLSX file to a TSV while removing embedded carriage returns from each cell?
When reading the XLSX file I'm currently reading each row cell by cell using $cell=$worksheet->get_cell($row,$col). I didn't see a method to return a whole row at at time.
I'm currently researching how to replace all chr(10) and chr(13) in Excel with blanks. Then I'll convert the XLSX to TSV.

Thank you!

Comment on Convert XLSX to TSV and remove CRLF in cells Select or Download Code

Replies are listed 'Best First'.
Re: Convert XLSX to TSV and remove CRLF in cells by MidLifeXis (Monsignor) on Jun 16, 2015 at 12:37 UTC
If I remember correctly, the spreadsheet reading modules load the data into memory by default. The data structure used to do this can be a fairly large hash, based on the options used. It appears that this particular module unzips the xlsx package into a temporary directory and then parses it into its own internal structure. Depending on the complexity of the formatting, parameters passed to the parser, amount of data in each row, and so on, this can be an expected behavior. If you are not in need of any of the formatting, you may want to try the ':just_the_data' flags from Spreadsheet::XLSX::Reader::LibXML, which (at least from my experience - and it may not be yours in this case) should speed things up significantly. If you are getting this twice a month, how much time do you expect to shave off from the task per month, and is that time in an automated process critical? If you (not the machine, you) are spending the 2.5 hours on this task, let's run some numbers (based on xkcd://is it worth the time): Shave off 2 hours per execution, 4 hours per month You will be running this for the next 5 years You can spend about 7 days to trim down the time you spend on this task. If it is not your time being used, then you really should consider if this automation must run as fast as possible, or if it will be sufficient to spend machine and clock cycles to remove the manual process. Typically your time is worth much more than the computer's time. Update: Also, profile the various parts of your application. Where is the hangup? Compilation, reading, converting the data, writing, memory allocation, disk I/O? Unless you know what is taking the most time, you are just throwing a handful of parts into the machine and hoping it fixes itself. --MidLifeXis	[reply]
Re: Convert XLSX to TSV and remove CRLF in cells by RichardK (Parson) on Jun 16, 2015 at 14:29 UTC
It might be worth having a look at Text::CSV and/or Text::CSV_XS to convert your TSV/CSV. They say that they can cope with embedded newlines in fields -- how to do it is explained in the help.	[reply]
Re: Convert XLSX to TSV and remove CRLF in cells by pme (Monsignor) on Jun 16, 2015 at 11:52 UTC
The embedded carriage returns can be replaced with '\n' using this simple script on the Red Hat box: `use strict; use warnings; while (<>) { chomp; if (/^M$/) { print "$_\n"; } else { print "$_\\n"; } }` [download] '^M' is single character, entered pressing 'ctrl-v enter'. Update: The direct conversion does not seem to be hopeless. You can simply omit the $converter if you do not need encoding conversion. I created an xlsx file with ~1.000.000 rows and with only two columns (file size ~10Mb) and it was converted to csv in 140sec.	[reply] [d/l]
Re^2: Convert XLSX to TSV and remove CRLF in cells by MidLifeXis (Monsignor) on Jun 16, 2015 at 12:16 UTC
'^M' is single character, entered pressing 'ctrl-v enter' in some editors. In others is may end up displaying as some sort of a line feed. [emphasis added] A better, more portable way of encoding this is `\015`, `\o{015}`, `\cM`, `\x0d`, or some other encoded form that won't potentially be messed up by an editor, printer, code pretty-printer, …. --MidLifeXis	[reply] [d/l] [select]
Re: Convert XLSX to TSV and remove CRLF in cells by planetscape (Chancellor) on Jun 17, 2015 at 15:18 UTC
I agree with RichardK. I was just looking at Re: Perl to convert a named workseet from an xlsx workbook with multiple worksheets into a csv file myself; bet you would also find it helpful. HTH, planetscape	[reply]
Re: Convert XLSX to TSV and remove CRLF in cells by jandrew (Chaplain) on Jun 17, 2015 at 18:10 UTC
As the author of Spreadsheet::XLSX::Reader::LibXML I can confirm that the package was not written to be the fastest (skip to the last paragraph for those packages). I would however, like to certify an element of your initial question. You list Text::Iconv as one of the modules that you use. Since that is an option for Spreadsheet::XLSX and not For Spreadsheet::XLSX::Reader::LibXML which spreadsheet parser are you using? MidLifeXis++ (xkcd++) already gave you the simple optimization to set group_return_type to 'unformatted' (fastest) or 'value' (a little less fast) You mentioned a desire to get a whole row (#3). Use the fetchrow_arrayref command if you wish. The 'fetchrow_array' and 'fetchrow_hashref' commands are also documented in the Worksheet pod Even with those elements this parser may still not be as fast as you want. I am always interested in improving my parser. My preference for improvement requests is for you to open an issue in my github repo so you can attach any files you are allowed or are willing to share for testing. I would be happy to see if there are speed optimizations available. With all that said, Spreadsheet::XLSX and Spreadsheet::ParseXLSX are both faster XLSX parsers by design. update:s/ParseExcel/ParseXLSX/	[reply]
Re^2: Convert XLSX to TSV and remove CRLF in cells by Tux (Canon) on Jun 17, 2015 at 19:15 UTC
Please do not promote Spreadsheet::XLSX (I deliberately omit the link so it is harder to click). It is deprecated and buggy. DO use Spreadsheet::ParseXLSX! Enjoy, Have FUN! H.Merijn	[reply]
Re^3: Convert XLSX to TSV and remove CRLF in cells by jandrew (Chaplain) on Jun 17, 2015 at 20:37 UTC
Tux I agree that Spreadsheet::XLSX is buggy. However, it is my experience that XML::Twig segfaults due to a perl bug in Windows perls prior to 5.15. Since Spreadsheet::ParseXLSX is built on that it makes both of these packages buggy for a certain population of users. (Which is partly why I wrote my package on XML::LibXML.) On the other hand I have run into a lot of implementations of both of these packages where people are quite happy with them. Additionally for small spreadsheets where you are only extracting data and not formats, Spreadsheet::XLSX tends to be faster. update:I think Spreadsheet::ParseXLSX also fails to open Excel sheets that contain dedicated chartsheets (not worksheets). The XML::Twig RT que is a bit daunting, and the current release on CPAN testers has open fails. Otherwise I agree that Spreadsheet::ParseXLSX is a really great module.	[reply]
Re^4: Convert XLSX to TSV and remove CRLF in cells by Tux (Canon) on Jun 18, 2015 at 06:51 UTC
Re^5: Convert XLSX to TSV and remove CRLF in cells by jandrew (Chaplain) on Jun 18, 2015 at 13:32 UTC
Re: Convert XLSX to TSV and remove CRLF in cells by jandrew (Chaplain) on Jun 30, 2015 at 02:26 UTC
For archival purposes I am posting to let interested parties know that Spreadsheet::XLSX::Reader::LibXML v0.38.4 was posted to CPAN with caching for the shared strings file which appears to possibly be the worst culprit for speed hits on large files with lots of text. It is set to default on but can be turned back off with the attribute cache_positions. I continue to welcome any specific feedback or test files for optimization on github And here is a link to the documentation for the implementation of the :just_the_data flag mentioned by MidLifeXis++	[reply]