ETLTCHFIG has asked for the wisdom of the Perl Monks concerning the following question:

We are using the Spreadsheet::XLSX module in our Production environment for reading Excel files on our AIX box and transform to CSV files. We were running fine until we received XLSX files greater than 40MB. We are now failing with an 'Out of memory' - I know this is because we are reading the whole file at once - but dont know if there is a way to reduce the memory usage (like in Spreadsheet::ParseExcel, I remember John had provided the cell handler function that reduced the size of the cell object). Is there something similar available for Spreadsheet::XLSX?

Your help is greatly appreciated!

Also, here is our perl environment information to aid your review - Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
Platform: osname=aix, osvers=5.3.0.0, archname=aix uname='aix laatdm62 3 5 00c396bc4c00 ' config_args='-de -Dcc=cc_r' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=und +ef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc_r -q32', ccflags ='-D_ALL_SOURCE -D_ANSI_C_SOURCE -D_POSIX_ +SOURCE -qmaxmem=-1 -qnoansialias -DUSE_NATIVE_DLOPEN -q32 -D_LARGE_FI +LES -qlonglong', optimize='-O', cppflags='-D_ALL_SOURCE -D_ANSI_C_SOURCE -D_POSIX_SOURCE -qmaxmem= +-1 -qnoansialias -DUSE_NATIVE_DLOPEN' ccversion='', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='ld', ldflags =' -brtl -bdynamic -b32 -bmaxdata:0x80000000' libpth=/lib /usr/lib /usr/ccs/lib libs=-lbind -lnsl -ldbm -ldl -lld -lm -lcrypt -lc perllibs=-lbind -lnsl -ldl -lld -lm -lcrypt -lc libc=/lib/libc.a, so=a, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_aix.xs, dlext=so, d_dlsymun=undef, ccdlflags=' -bE:/usr/ +local/lib/perl5/5.10.1/aix/CORE/perl.exp' cccdlflags=' ', lddlflags=' -bhalt:4 -G -bI:$(PERL_INC)/perl.exp - +bE:$(BASEEXT).exp -bnoentry -lc -lm' Characteristics of this binary (from libperl): Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO Built under aix Compiled at Feb 26 2011 02:13:00 @INC: /usr/local/lib/perl5/5.10.1/aix /usr/local/lib/perl5/5.10.1 /usr/local/lib/perl5/site_perl/5.10.1/aix /usr/local/lib/perl5/site_perl/5.10.1

Replies are listed 'Best First'.
Re: Out of Memory
by Corion (Patriarch) on May 05, 2011 at 06:44 UTC

    If all else fails, you might have to use XML::Twig and do the XML parsing of the XLSX sheet yourself. This will mean taking much of the extraction code of Spreadsheet::XLSX and remunging it into a callback-based interface (like Spreadsheet::ParseExcel)...

      First up thanks a lot for the ideas! So there is no way around rewriting the code to overcome this out-of-memory error? I mean as a quick fix?

      Not trying to be lazy; I ask as this is holding Production already - if I could do something to get this piece going in the interim, I could take time out to rewrite, test and promote the rewritten version.

      Any ideas??

      ulimit -a shows the following -
      fdrread@laatdm62:/home/fdrread> ulimit -a time(seconds) unlimited file(blocks) unlimited data(kbytes) unlimited stack(kbytes) 4194304 memory(kbytes) unlimited coredump(blocks) unlimited nofiles(descriptors) unlimited
      Again, THANKS A TON!
Re: Out of Memory
by vkon (Curate) on May 05, 2011 at 07:35 UTC
    well, looking at Spreadsheet::XLSX module reveals that it uses regular expressions to parse XLSX
    $mstr =~ s/<t\/>/<t><\/t>/gsm; # this handles an empty t tag +in the xml <t/>
    and this is widely accepted as bad approach.

    Actually such regular expressions on huge strings usually cause out-of-memory.

    I can not think of any better advice, other than updating this module and then proposing changes to author.

    Regards,
    Vadim.

      While parsing general XML with regular expressions can be fragile, parsing very specific XML can work well with regular expressions. But as the parser still seems to generate the complete data structure in memory instead of nibbling at columns and calling a callback whenever data is available, an event-based rewrite is still necessary.

        I agree.

        Let me share my experience on a problem that OP raised.
        Usually at work, when I need some transformation, I often write small and quick program that slurps entire file and uses regexp to produce a result.
        this works and I use such a program from time to time.
        then, this program fed with a file of ten or hundred Mbytes (which is no more rare nowadays) and I go to my initial program and optimize it so it could rpocess such files.

        Actually - I think - it is quite common situation.

        There's no need to write a very optimal and highly structured program from the very beginning - "dirty hack" suffice as a first step, and then incremental improvements could be made, if needed - the point - they could never be needed :)
        Fortunately, perl is powerful enough to allow such an approach.

      Ok - I am looking at XLSX to see if I can make changes to remove stuff I dont need - since all i am trying to do in the production code is read the cell value and write it out to csv file

      Can I eliminate all that excel style processing?
      my $member_styles = $self -> {zip} -> memberNamed ('xl/styles. +xml'); my @styles = (); my %style_info = (); if ($member_styles) { foreach my $t ($member_styles -> contents =~ /xf\ numF +mtId="([^"]*)"(?!.*\/cellStyleXfs)/gsm) { #" # $t = $converter -> convert ($t) if $converter +; push @styles, $t; } my $default = $1 || ''; foreach my $t1 (@styles){ $member_styles -> contents =~ /numFmtId="$t1" formatCode=" +([^"]*)/; my $formatCode = $1 || ''; if ($formatCode eq $default || not($formatCode)){ if ($t1 == 9 || $t1==10){ $formatCode="0.00000%";} elsif ($t1 == 14){ $formatCode="m-d-yy";} else { $formatCode=""; } } $style_info{$t1} = $formatCode; $default = $1 || ''; } }

      Also, I am considering reducing the cell object size by removing format and type hash keys So change

      } my $cell =Spreadsheet::ParseExcel::Cell->new( Val => $v, Format => $thisstyle, Type => $type );
      TO
      } my $cell =Spreadsheet::ParseExcel::Cell->new( Val => $v );
      Please let me know what you think
        I think this way.

        given that all you need is to get XLSX data - probably to get some kind of DB values - then, just do not use this CPAN module, just extract your data yourself,
        be your-application-centric.

        But before doing that, drop a letter to CPAN author - it could be that he is responsive and will provide you with a solution soon.
        Otherwise - just reuse his code of "unzipping" the content, and then use your own regular expression.

        But better than that - feed you resulting XML string into properly constructed XPATH expression - and feed this XPATH expression to Xml::LibXML - it is very efficient on XPath expressions, but other modules dealing with XPath also will suffice.
        (I - personally - have good experience with mentioned one, and TIMTOWTDI)

        This would be best way out of this situation - this is how I feel it.

        Is 40Mb - a size of ZIPped XLSX, or it is a size after unpacking?

Re: Out of Memory
by ETLTCHFIG (Novice) on May 05, 2011 at 00:25 UTC

    Also, wanted to share - I can see the perl routine process RSS grows to 2GB when the Out of memory is issued -

    fdrread@laatdm62:/data/fdr/scripts/irp> ps v 790630 PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %ME +M COMMAND 790630 pts/0 A 0:18 0 1450840 1452752 xx 1052 1912 9.0 + 8.0 perl ./ fdrread@laatdm62:/data/fdr/scripts/irp> ps v 790630 PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %ME +M COMMAND 790630 pts/0 A 0:20 0 1475716 1477628 xx 1052 1912 9.1 + 8.0 perl ./ fdrread@laatdm62:/data/fdr/scripts/irp> ps v 790630 PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %ME +M COMMAND 790630 pts/0 A 0:39 0 2050400 2052312 xx 1052 1912 9.5 + 11.0 perl ./ fdrread@laatdm62:/data/fdr/scripts/irp> Out of memory!
Re: Out of Memory
by runrig (Abbot) on Jun 07, 2012 at 23:19 UTC
    I remember John had provided the cell handler function that reduced the size of the cell object). Is there something similar available for Spreadsheet::XLSX?

    Patch posted to RT. Cheers! (Now, does anyone know if the author is still around and maintaining the library?).

Re: Out of Memory
by ETLTCHFIG (Novice) on May 05, 2011 at 20:46 UTC

    I was hoping this will help your review in providing a fix. I noticed the module fails trying to read the entire XML worksheet content at the biggest worksheet in my spreadsheet with 1 million rows in it -

    <font size="6"> foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/ +g) { </font>

    Can we change this go line-by-line? or in chunks? not sure I totally understand how this works.

      I am in pretty bad shape gurus ... This is production being held up at our shop - and I am on the hook for this .. (going with PERL) ... could you please help me figure a way out here? I would truly appreciate it .. I feel like I have hit the roof here with my understanding .. cant figure how I would improve the following code that is trying to cache the whole unzipped worksheet in XML form into an array -

      foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/ +g) {
      This the point where the program runs out of memory trying to read a 1 million record file Leon
Re: Out of Memory
by ETLTCHFIG (Novice) on May 06, 2011 at 01:35 UTC
    any chance I could "disk" the array $member_sheet -> contents???
Re: Out of Memory
by ETLTCHFIG (Novice) on May 06, 2011 at 01:51 UTC
    I realize I cant disk the array as I understand the foreach loop has not transformed the @member_sheet->contents into an array but is repeatedly performing regex comparisons using /g - so the @member_sheet->contents is really being used as a scalar ... that's too much data in a variable!!! given that this will return the 1 million record sheet in an XML format. The only other solution would be to start using XML::Twig, right and start reading the file without loading it all into memory? Could someone help me with this? I realize I am asking to be 'flat out carried' ... but I really am technically challenged here .. I honeslty am even ready to pay for support now! I am hoping Dmitry (author) responds with the solution - I am pretty much screwed otherwise!
Re: Out of Memory
by oxone (Friar) on May 06, 2011 at 06:01 UTC

    For what it's worth, we hit exactly the same problem yesterday: ie. realising that Spreadsheet::XLSX just isn't fit for purpose with large files.

    On looking at the XLSX.pm code, it became clear it's all a bit of a hack (as noted in various posts above): XML is not parsed properly, inefficient use of regexes, apparent problems with extracting special chars ... even if all that worked OK, it then tries to write the entire spreadsheet in-memory into a Spreadsheet::ParseExcel::Workbook object, which itself would slurp up huge amounts of memory, if it managed to get that far.

    To be fair to the author, he does describe the module as "quick and dirty", and I guess that's the down-side of CPAN: production-quality is not guaranteed. Lesson we've learnt: when assessing that module originally, we should've tested it on a large file up front. Doh.

    So, first we're going to test 'SimpleXlsx' (which is also on CPAN) to see if that's any better: not yet tried, but it does claim to parse the XML properly. And if that's no good, we're going to have to roll our own.

      For the record: 'SimpleXlsx' now checked, and no better on big files. So, it seems there is no module on CPAN which can parse a large XLSX file :-(

      Will need to write a better version then, unless OP has already found a solution?

Re: Out of Memory
by LanX (Saint) on May 05, 2011 at 00:38 UTC
    Unfortunately your post is not easily readable, please update and insert code tags <c>...</c> and maybe some paragraphs <p>.

    Cheers Rolf

      Sorry for the poor formatting! Hopefully this helps - please let me know if there is a way out. SOS!!!!!