memory usage Spreadsheet::ParseExcel

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: memory usage Spreadsheet::ParseExcel by ww (Archbishop) on May 13, 2012 at 12:24 UTC
Actually, it's inherent that Perl releases the memory once it's no longer in use (ref count = 0)... But, it doesn't release the memory (deallocate it) back to the OS; it holds it for reuse. So, code and data (small samples, please) to illustrate your problem would be helpful. (Of course, jmcnamara may have an answer without that....)	[reply]
Re^2: memory usage Spreadsheet::ParseExcel by Anonymous Monk on May 13, 2012 at 13:23 UTC
I looked at jmcnamara but no help! NOTE: I am hashing a sheet at a time because no where have I found it stated that absolutely the rows and columns in the xls and passed through the cell_handler are in sequential order. NOTE: When executing the below where @workbook_zip_FNS is a list of xls files I get out of memory. But when I run it on each xls file individually (in a batch file passing each xls into the script replacing the for (my $i=0;$i < scalar(@workbook_zip_FNS) ;$i++ ) iterator) then each run goes to completion without error. my $cell_outputfh; my $xls_parser = Spreadsheet::ParseExcel->new( CellHandler => \&HH_cell_handler, NotSetCell => 1 ); for (my $i=0;$i < scalar(@workbook_zip_FNS) ;$i++ ) { open ( $cell_outputfh, ">:encoding(iso-8859-1)", $ribbon_srcgeo[$i +].".csv" ) or die " open failed on $ribbon_srcgeo[$i].csv $!"; my $workbook = $xls_parser->parse($unzipped_xls_FNS[$i]); if ( !defined $workbook ) { die $xls_parser->error(), ".\n"; } print "\n"; close $cell_outputfh; for my $worksheet ( $workbook->worksheets() ) { my $wsname=$worksheet->get_name(); print $wsname."*"; open ( $cell_outputfh, "<:encoding(iso-8859-1)", $ribbon_srcge +o[$i].".csv" ) or die " open failed on $ribbon_srcgeo[$i].csv $!"; my $HofA_sheet_rows; while (<$cell_outputfh>) { my @fields = split /","/, $_; $fields[0] =~ s/^"//; $fields[3] =~ s/"$//; if( $fields[0] eq $wsname ) { $HofA_sheet_rows{$wsname}[$fields[1]][$fields[2]] = $f +ields[3]; } } undef $HofA_sheet_rows; close $cell_outputfh; } undef $workbook; } sub HH_cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; my $worksheet = $workbook->worksheet($sheet_index); my $wsname=$worksheet->get_name(); my $value='"'.$wsname.'"'.','.'"'.$row.'"'.','.'"'.$col.'"'.','.'" +'.$cell->value().'"'."\n"; print $cell_outputfh $value; } [download]	[reply] [d/l]
Re^3: memory usage Spreadsheet::ParseExcel by ww (Archbishop) on May 13, 2012 at 15:03 UTC
The reference was to the Monk who is the module's ~~author~~ current maintainer (extender, enhancer and general expert) and who frequently responds directly to questions on use of his contributions. You might have known that had you: Read the module's documentation (`perldoc Spreadsheet::ParseExcel`) -- something you can (almost certainly) do at your own terminal. Construed the link as something other than a direct answer and search CPAN for the id posted on his personal page here. Update: Correcting the title I attributed to jmcnamara	[reply] [d/l]
Re^4: memory usage Spreadsheet::ParseExcel by Anonymous Monk on May 13, 2012 at 16:39 UTC
Re^3: memory usage Spreadsheet::ParseExcel by Anonymous Monk on May 13, 2012 at 14:00 UTC
I have been experimenting with a number of different ways to solve this out of memory problem. The code snippet above has a bug the my $HofA_sheet_rows; should be my %HofA_sheet_rows; and undef $HofA_sheet_rows; should be undef %HofA_sheet_rows; I am rerunning to see if problem raise its head again. I will post my results when run is complete.	[reply]
Re: memory usage Spreadsheet::ParseExcel by runrig (Abbot) on May 13, 2012 at 21:43 UTC
It's a known issue that S::PE objects have cyclic references, so objects never get completely freed. The simplest solution is to create each object in a separate process (i.e. fork, create object, process, end child process). The other issue of parsing the entire document to memory as you've found is solved by using the CellHandler and NotSetCell attributes (or with Spreadsheet::ParseExcel::Stream), but it is not guaranteed that the cells are parsed "in order." They seem to be parsed in order when the spreadsheet is created by Excel, but not, e.g., when the spreadsheet is created by Spreadsheet::WriteExcel without using 'compatibility_mode' and the rows/cells are written out of order.	[reply]
Re^2: memory usage Spreadsheet::ParseExcel by Anonymous Monk on May 14, 2012 at 11:34 UTC
Thanks for the input. Spreadsheet::ParseExcel is a good module and gets the job done. Thanks for that, too. For those that follow, I experimented writing to a DBM:: Deep but is slow while DBI->connect is faster when AutoCommit => 0 but that uses memory, too. The strategy that I followed was to write files of sheet cells in the cell_handler and then read these into a hash to get the cells row and column sequential. A 131MB xls parse and the sheet hash would fit into 4GB of memory, barely. Now if the xls is much bigger or the sheets are much bigger, I will have to fork the parsing to a separate child process in order to free up the memory used by the parser. A 131MB xls consisting of one sheet ain't going to fit in 4GB in one process. Now the simplest solution is to increase the computers available core and fight the out of memory battle with $27. But alas, the liberals and the marxists in the Whitehouse have gotten this economy so screwed up with dreams of green energy that I cannot afford buying a new computer (mine is at its 4GB limit) and pay for food,shelter,clothing, and taxes. I will of course proceed to proceed!	[reply]
Re^3: memory usage Spreadsheet::ParseExcel by runrig (Abbot) on May 21, 2012 at 02:18 UTC
Did you try parsing using the CellHandler and NotSetCell attributes, and find that they were parsed out of order? You don't say whether you tried or not, you just seem to fear that they might be parsed out of order. You should try it before you reject the notion. As I said earlier, they will likely be parsed in order unless the spreadsheet was created by something like Spreadsheet::WriteExcel and the cells were written out of order.	[reply]
Re: memory usage Spreadsheet::ParseExcel by BrowserUk (Patriarch) on May 14, 2012 at 12:51 UTC
Do the Monks have any idea how to free the memory used ... If, as appears to be the case from the follow-on discussion, the problem is circular references preventing the data structures being destroyed, maybe running a depth-first function to free it might resolve the problem. See how you get on with this: `sub recursiveFree { my $ref = shift; if( ref $ref eq 'ARRAY' ) { recursiveFree( $ref->[ $_ ] ) for 0 .. $#{ $ref }; } elsif( ref $ref eq 'HASH' ) { recursiveFree( $ref->{ $_ } ) for keys %{ $ref }; } elsif( ref $ref and ref $ref ne 'SCALAR' ) { warn "Unhandled reftype: ", ref( $ref ); } undef $ref; return; } recursiveFree( $xls );` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^2: memory usage Spreadsheet::ParseExcel by Anonymous Monk on May 14, 2012 at 13:59 UTC
Tried it, no help, but great thanks anyway "the referenced object has been blessed into a package, then that package name is returned instead. I added Spreadsheet::ParseExcel:: ::Workbook ::FmtDefault ::Format ::Font ::Worksheet as refs for Recursion. None seem to free memory and ::Worksheet warned on deep recursion and never came back up out of the stack	[reply]
Re^3: memory usage Spreadsheet::ParseExcel by BrowserUk (Patriarch) on May 14, 2012 at 15:00 UTC
"the referenced object has been blessed into a package, then that package name is returned instead. Switching from ref to Scalar::Util`::reftype()` would fix that. and ::Worksheet warned on deep recursion and never came back up out of the stack Using Scalar::Util`::refaddr()` with a `%seen` hash could prevent it recursing up its own bum. The code I posted was by way of example. Just a blind tweak of a depth-first recursive structure traversal I had kicking around. I'd have a go a making the above tweaks if I had a suitable example .xls file. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^4: memory usage Spreadsheet::ParseExcel by Anonymous Monk on May 15, 2012 at 00:23 UTC
Re^5: memory usage Spreadsheet::ParseExcel by BrowserUk (Patriarch) on May 15, 2012 at 00:47 UTC