I've been trying to see if Spreadsheet::XLSX can be made to not consume so much memory, and I've implemented the CellHandler and NotSetCell attributes from Spreadsheet::ParseExcel, and this helps somewhat, but part of the problem is the extraction of each file in the zip archive to an in memory variable (an xlsx file is a zip archive of many xml files). E.g., one of the worksheet xml files is about 8MB, and the code that parses it like so:
foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) { ...

$member_sheet is an Archive::Zip object for the worksheet xml file and contents() returns the entire contents of the file. Then the entire file is parsed into an array of tags and text that the foreach loop processes. In trying to save memory, I first was just trying to see if I could process the contents with a while loop like so:

my $buffer = $member_sheet -> contents(); #while ($buffer =~ /\G(<[^<]*>)/scg or $buffer =~ /\G([^<]*(?=<))/scg) + { while ($buffer =~ /(<[^>]+>|[^<]+(?=<))/sg) { $_ = $1; ...

While this seems to work, it's about 100 times slower. I don't know for sure why it's 100 times slower, but I've tried to make a benchmark that shows it should only be about 50% slower:

use Benchmark qw(cmpthese); open(my $fh, "<", 'sheet3.xml') or die "Err: $!"; my $str = do { local $/; <$fh>}; cmpthese(-10, { FOR => sub { pos($str) = 0; for ( $str =~ /(\<.*?\/?\>|.*?(?=\<))/g ) { #print "$_\n"; } }, WHILE_1 => sub { pos($str) = 0; while ( $str =~ /\G(<.*?>|.*?(?=<))/scg ) { $_ = $1; #print "$_\n"; } }, WHILE_2 => sub { pos($str) = 0; while ( $str =~ /\G(<.*?>)/scg or $str =~ /\G(.*?(?=<))/scg ) { $_ = $1; #print "$_\n"; } }, WHILE_3 => sub { pos($str) = 0; while ( $str =~ /\G(<[^>]*>)/scg or $str =~ /\G([^<]*(?=<))/scg ) +{ $_ = $1; #print "$_\n"; } }, }); # Results: s/iter FOR WHILE_3 WHILE_2 WHILE_1 FOR 1.66 -- -27% -31% -35% WHILE_3 1.21 37% -- -6% -11% WHILE_2 1.14 46% 6% -- -5% WHILE_1 1.08 53% 12% 5% --

Is there something wrong with my benchmark (or just something wrong with trying to benchmark this)? Something else going on in Spreadsheet::XLSX? Anyone with enough tuits to look at or comment on this?

TIA for any insights


In reply to Spreadsheet::XLSX memory and speed by runrig

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.