Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
A simple technique to help reduce the memory used by Spreadsheet::ParseExcel.

Spreadsheet::ParseExcel processes Excel files in two stages. In the first stage it extracts the Excel binary stream from the OLE container file using OLE::Storage_Lite. In the second stage it parses the binary stream to read workbook, worksheet and cell data which it then stores in memory. Of the second stage storage by far the largest amount is taken up by the cell data.

As each cell is encountered a cell handling function creates a relatively large nested cell object that contains the cell value and all of the data that relates to the cell formatting. For large files (a 10MB Excel file on a 256MB system) this overhead can cause the system to grind to a halt.

However, in a lot of cases the only information that is required is the cell value. In these cases it is possible to avoid most of the memory overhead by specifying your own cell handling function and by telling Spreadsheet::ParseExcel not to store the parsed cell data. Here is an example.

#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Do something useful with the formatted cell value print $cell->{_Value}, "\n"; }
The user specified cell handler is passed as a code reference to new() along with the parameter NotSetCell which tells Spreadsheet::ParseExcel not to store the parsed cell.

The cell handler is passed 5 arguments. The first, $workbook, is a reference to the Spreadsheet::ParseExcel::Workbook object that represent the parsed workbook. This can be used to access any of the parsed workbook global data. The second $sheet_index is the zero-based index of the worksheet being parsed. The third and fourth are the zero-based row and column number of the cell. The fifth, $cell, is a reference to the Spreadsheet::ParseExcel::Cell object. This can be used extract the data from the cell.

Each cell contains an unformatted value, $cell->{Val} and a formatted (by Spreadsheet::ParseExcel) value, $cell->{_Value}. In a majority of cases the latter value is the one that is required.

If you don't want all of the data in the spreadsheet you can add some checks to the cell handler. The following example only prints the first 10 rows of the first two worksheets in the parsed workbook.

#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Skip some worksheets and rows (inefficiently). return if $sheet_index >= 3; return if $row >= 10; # Do something with the formatted cell value print $cell->{_Value}, "\n"; }
However, this still processes the entire workbook. If you wish to save some time you can abort the parsing after you have read the data that you want:
#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Skip some worksheets and rows (more efficiently). if ($sheet_index >= 1 and $row >= 10) { $workbook->ParseAbort(1); return; } # Do something with the formatted cell value print $cell->{_Value}, "\n"; }

In general this technique is useful if you are writing an Excel to database filter since you can put your DB calls in the cell handler.


In reply to Reducing the memory usage of Spreadsheet::ParseExcel by jmcnamara

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (9)
As of 2024-03-28 12:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found