Re: Convert XLSX to TSV and remove CRLF in cells
by MidLifeXis (Monsignor) on Jun 16, 2015 at 12:37 UTC
|
If I remember correctly, the spreadsheet reading modules load the data into memory by default. The data structure used to do this can be a fairly large hash, based on the options used. It appears that this particular module unzips the xlsx package into a temporary directory and then parses it into its own internal structure. Depending on the complexity of the formatting, parameters passed to the parser, amount of data in each row, and so on, this can be an expected behavior. If you are not in need of any of the formatting, you may want to try the ':just_the_data' flags from Spreadsheet::XLSX::Reader::LibXML, which (at least from my experience - and it may not be yours in this case) should speed things up significantly.
If you are getting this twice a month, how much time do you expect to shave off from the task per month, and is that time in an automated process critical? If you (not the machine, you) are spending the 2.5 hours on this task, let's run some numbers (based on xkcd://is it worth the time):
- Shave off 2 hours per execution, 4 hours per month
- You will be running this for the next 5 years
You can spend about 7 days to trim down the time you spend on this task.
If it is not your time being used, then you really should consider if this automation must run as fast as possible, or if it will be sufficient to spend machine and clock cycles to remove the manual process. Typically your time is worth much more than the computer's time.
Update: Also, profile the various parts of your application. Where is the hangup? Compilation, reading, converting the data, writing, memory allocation, disk I/O? Unless you know what is taking the most time, you are just throwing a handful of parts into the machine and hoping it fixes itself.
| [reply] |
Re: Convert XLSX to TSV and remove CRLF in cells
by RichardK (Parson) on Jun 16, 2015 at 14:29 UTC
|
It might be worth having a look at Text::CSV and/or Text::CSV_XS to convert your TSV/CSV. They say that they can cope with embedded newlines in fields -- how to do it is explained in the help.
| [reply] |
Re: Convert XLSX to TSV and remove CRLF in cells
by pme (Monsignor) on Jun 16, 2015 at 11:52 UTC
|
The embedded carriage returns can be replaced with '\n' using this simple script on the Red Hat box:
use strict;
use warnings;
while (<>) {
chomp;
if (/^M$/) {
print "$_\n";
} else {
print "$_\\n";
}
}
'^M' is single character, entered pressing 'ctrl-v enter'.
Update: The direct conversion does not seem to be hopeless. You can simply omit the $converter if you do not need encoding conversion. I created an xlsx file with ~1.000.000 rows and with only two columns (file size ~10Mb) and it was converted to csv in 140sec. | [reply] [d/l] |
|
'^M' is single character, entered pressing 'ctrl-v enter' in some editors. In others is may end up displaying as some sort of a line feed. [emphasis added]
A better, more portable way of encoding this is \015, \o{015}, \cM, \x0d, or some other encoded form that won't potentially be messed up by an editor, printer, code pretty-printer, ….
| [reply] [d/l] [select] |
Re: Convert XLSX to TSV and remove CRLF in cells
by planetscape (Chancellor) on Jun 17, 2015 at 15:18 UTC
|
| [reply] |
Re: Convert XLSX to TSV and remove CRLF in cells
by jandrew (Chaplain) on Jun 17, 2015 at 18:10 UTC
|
As the author of Spreadsheet::XLSX::Reader::LibXML I can confirm that the package was not written to be the fastest (skip to the last paragraph for those packages). I would however, like to certify an element of your initial question. You list Text::Iconv as one of the modules that you use. Since that is an option for Spreadsheet::XLSX and not For Spreadsheet::XLSX::Reader::LibXML which spreadsheet parser are you using?
MidLifeXis++ (xkcd++) already gave you the simple optimization to set group_return_type to 'unformatted' (fastest) or 'value' (a little less fast)
You mentioned a desire to get a whole row (#3). Use the fetchrow_arrayref command if you wish. The 'fetchrow_array' and 'fetchrow_hashref' commands are also documented in the Worksheet pod
Even with those elements this parser may still not be as fast as you want. I am always interested in improving my parser. My preference for improvement requests is for you to open an issue in my github repo so you can attach any files you are allowed or are willing to share for testing. I would be happy to see if there are speed optimizations available.
With all that said, Spreadsheet::XLSX and Spreadsheet::ParseXLSX are both faster XLSX parsers by design.
update:s/ParseExcel/ParseXLSX/
| [reply] |
|
| [reply] |
|
Tux I agree that Spreadsheet::XLSX is buggy. However, it is my experience that XML::Twig segfaults due to a perl bug in Windows perls prior to 5.15. Since Spreadsheet::ParseXLSX is built on that it makes both of these packages buggy for a certain population of users. (Which is partly why I wrote my package on XML::LibXML.)
On the other hand I have run into a lot of implementations of both of these packages where people are quite happy with them. Additionally for small spreadsheets where you are only extracting data and not formats, Spreadsheet::XLSX tends to be faster.
update:I think Spreadsheet::ParseXLSX also fails to open Excel sheets that contain dedicated chartsheets (not worksheets). The XML::Twig RT que is a bit daunting, and the current release on CPAN testers has open fails. Otherwise I agree that Spreadsheet::ParseXLSX is a really great module.
| [reply] |
|
|
Re: Convert XLSX to TSV and remove CRLF in cells
by jandrew (Chaplain) on Jun 30, 2015 at 02:26 UTC
|
For archival purposes I am posting to let interested parties know that Spreadsheet::XLSX::Reader::LibXML v0.38.4 was posted to CPAN with caching for the shared strings file which appears to possibly be the worst culprit for speed hits on large files with lots of text. It is set to default on but can be turned back off with the attribute cache_positions. I continue to welcome any specific feedback or test files for optimization on github
And here is a link to the documentation for the implementation of the :just_the_data flag mentioned by MidLifeXis++
| [reply] |