http://qs1969.pair.com?node_id=379743

A simple technique to help reduce the memory used by Spreadsheet::ParseExcel.

Spreadsheet::ParseExcel processes Excel files in two stages. In the first stage it extracts the Excel binary stream from the OLE container file using OLE::Storage_Lite. In the second stage it parses the binary stream to read workbook, worksheet and cell data which it then stores in memory. Of the second stage storage by far the largest amount is taken up by the cell data.

As each cell is encountered a cell handling function creates a relatively large nested cell object that contains the cell value and all of the data that relates to the cell formatting. For large files (a 10MB Excel file on a 256MB system) this overhead can cause the system to grind to a halt.

However, in a lot of cases the only information that is required is the cell value. In these cases it is possible to avoid most of the memory overhead by specifying your own cell handling function and by telling Spreadsheet::ParseExcel not to store the parsed cell data. Here is an example.

#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Do something useful with the formatted cell value print $cell->{_Value}, "\n"; }
The user specified cell handler is passed as a code reference to new() along with the parameter NotSetCell which tells Spreadsheet::ParseExcel not to store the parsed cell.

The cell handler is passed 5 arguments. The first, $workbook, is a reference to the Spreadsheet::ParseExcel::Workbook object that represent the parsed workbook. This can be used to access any of the parsed workbook global data. The second $sheet_index is the zero-based index of the worksheet being parsed. The third and fourth are the zero-based row and column number of the cell. The fifth, $cell, is a reference to the Spreadsheet::ParseExcel::Cell object. This can be used extract the data from the cell.

Each cell contains an unformatted value, $cell->{Val} and a formatted (by Spreadsheet::ParseExcel) value, $cell->{_Value}. In a majority of cases the latter value is the one that is required.

If you don't want all of the data in the spreadsheet you can add some checks to the cell handler. The following example only prints the first 10 rows of the first two worksheets in the parsed workbook.

#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Skip some worksheets and rows (inefficiently). return if $sheet_index >= 3; return if $row >= 10; # Do something with the formatted cell value print $cell->{_Value}, "\n"; }
However, this still processes the entire workbook. If you wish to save some time you can abort the parsing after you have read the data that you want:
#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Skip some worksheets and rows (more efficiently). if ($sheet_index >= 1 and $row >= 10) { $workbook->ParseAbort(1); return; } # Do something with the formatted cell value print $cell->{_Value}, "\n"; }

In general this technique is useful if you are writing an Excel to database filter since you can put your DB calls in the cell handler.