Warning: Bugs lurking in code below. Updates at Spreadsheet-ParseExcel-Stream

Bugs not fixed in code below (but AFAICT, are fixed on CPAN): Since the code doesn't transfer back to the main thread when the parsing is done, the program exits. After fixing that, the code leaked memory when parsing multiple documents, so that is fixed also.

I like Spreadsheet::ParseExcel::Simple, which has up until now served 99.9% of my spreadsheet parsing needs, but it (and Spreadsheet::ParseExcel by default) parses the entire spreadsheet document into memory. The Spreadsheet::ParseExcel docs have a section on reducing memory, but it's through a callback, which makes it difficult to retain the API of Spreadsheet::ParseExcel::Simple. But through the magic of Coro, here is an interface similar to Spreadsheet::ParseExcel::Simple, but without the memory overhead of parsing the entire document into memory:

package Spreadsheet::ParseExcel::Stream; use strict; use warnings; use Spreadsheet::ParseExcel; use Scalar::Util qw(weaken); use Coro; sub new { my ($class, $file) = @_; my $main = Coro::State->new(); my ($xls,$parser); my ($wb, $idx, $row, $col, $cell); my $tmp = my $handler = sub { ($wb, $idx, $row, $col, $cell) = @_; $parser->transfer($main); }; # Break the reference cycle weaken($handler); $parser = Coro::State->new(sub { $xls->Parse($file) }); $xls = Spreadsheet::ParseExcel->new( CellHandler => $handler, NotSetCell => 1, ); my $generator = sub { $main->transfer($parser); return [ $wb, $idx, $row, $col, $cell ]; }; my $nxt_cell = $generator->(); bless { NEXT_CELL => $nxt_cell, SUB => $generator, }, $class . '::Sheet'; } package Spreadsheet::ParseExcel::Stream::Sheet; sub sheet { my $self = shift; return unless $self->{NEXT_CELL}; return $self; } sub worksheet { my $self = shift; my $row = $self->{NEXT_CELL}; my $wb = $row->[0]; return $wb->worksheet($row->[1]); } sub name { my $self = shift; return $self->worksheet()->{Name}; } sub next_row { my ($self, $current) = @_; return $self->{CURR_ROW} if $current; return $self->{NEW_WS} = 0 if $self->{NEW_WS}; # Save original cell so we can detect change in worksheet my $orig_cell = my $curr_cell = $self->{NEXT_CELL}; my $f = $self->{SUB}; # Initialize row with first cell my @row = ($curr_cell); my $nxt_cell = $f->(); # Collect current row on current worksheet while ( $nxt_cell->[1] == $curr_cell->[1] && $nxt_cell->[2] == $curr +_cell->[2] ) { $curr_cell = $nxt_cell; push @row, $curr_cell; $nxt_cell = $f->(); } $self->{NEXT_CELL} = $nxt_cell; $self->{NEW_WS}++ if $orig_cell->[1] != $nxt_cell->[1]; return $self->{CURR_ROW} = \@row; } sub row { my ($self, $current) = @_; unless ($current) { my $row = $self->next_row(); return unless $row; } return [ map { $_->[4]->value() } @{$self->{CURR_ROW}} ]; } sub row_unformatted { my ($self, $current) = @_; unless ($current) { my $row = $self->next_row(); return unless $row; } return [ map { $_->[4]->unformatted() } @{$self->{CURR_ROW}} ]; } 1; __END__ =head1 NAME Spreadsheet::ParseExcel::Stream - Simple interface to Excel data with +less memory overhead =head1 SYNOPSIS my $xls = Spreadsheet::ParseExcel::Stream->new($xls_file); while ( my $sheet = $xls->sheet() ) { while ( my $row = $sheet->row ) { my @data = @$row; } } =head1 DESCRIPTION A simple iterative interface to L<Spreadsheet::ParseExcel>, similar to + L<Spreadsheet::ParseExcel::Simple>, but does not parse the entire document to memory. Uses the hints provi +ded in the L<Spreadsheet::ParseExcel> docs to reduce memory usage, and returns the data row by row and sheet + by sheet. =head1 METHODS =head2 new my $xls = Spreadsheet::ParseExcel::Stream->new($xls_file); Opens the spreadsheet and returns an object to iterate through the dat +a. =head2 sheet Returns the sheet of the next cell of the spreadsheet. =head2 row Returns the next row of data from the current spreadsheet. The data is + the formatted contents of each cell as returned by the $cell->value() method of Spre +adsheet::ParseExcel. If a true argument is passed in, returns the current row of data witho +ut advancing to the next row. =head2 unformatted Returns the next row of data from the current spreadsheet as returned by the $cell->unformatted() method of Spreadsheet::ParseExcel. If a true argument is passed in, returns the current row of data witho +ut advancing to the next row. =head2 next_row Returns the next row of cells from the current spreadsheet as Spreadsh +eet::ParseExcel cell objects. If a true argument is passed in, returns the current row without advan +cing to the next row. =head2 name Returns the name of the next cell of the spreadsheet. =head2 worksheet Returns the worksheet containing the next cell of data as a Spreadshee +t::ParseExcel object. =head1 AUTHOR runrig =head1 COPYRIGHT This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =head1 SEE ALSO L<Spreadsheet::ParseExcel>, L<Spreadsheet::ParseExcel::Simple> =cut

Replies are listed 'Best First'.
Re: RFC: Spreadsheet::ParseExcel::Stream
by jmcnamara (Monsignor) on Apr 26, 2011 at 09:01 UTC
    Excellent. ++

    It would be great to have a similar implementation for Spreadsheet::ParseExcel.

    Would you care to tackle that? :-)

    In which case it might to better to call this module Spreadsheet::ParseExcel::Stream::Simple (or ::Simple::Stream) to save the Spreadsheet::ParseExcel::Stream namespace for a Spreadsheet::ParseExcel implementation.

    --
    John.

      I'm not sure what you'd want out of such a thing, but I'm sure it's more than I can take on at the moment. And this was more of a demonstration of the difficulty in turning a callback-based interface into an iterative one (and a learning experience) :-)
Re: RFC: Spreadsheet::ParseExcel::Stream
by runrig (Abbot) on Jul 20, 2011 at 18:30 UTC
    The main drawback to the Spreadsheet::ParseExcel callback interface is that it gets called for every cell. It would be nice if the callbacks could process the data row by row (when desired). So here is an implementation that does that. And it has the advantage over the OP that no Coro is necessary:
    package Spreadsheet::ParseExcel::CB; use strict; use warnings; use Spreadsheet::ParseExcel; sub new { my $class = shift; my %args = @_; my $sheet_f = $args{Sheet}; my $row_f = $args{Row}; bless { Sheet => $sheet_f, Row => $row_f, }, $class; } sub parse { my ($self,$file) = @_; my $row_f = $self->{Row}; my $sheet_f = $self->{Sheet}; my @cells; my $sheet = ''; my ($wb, $idx, $row, $col, $cell); my ($prv_idx, $prv_row) = (-1,-1); my $handler = sub { ($wb, $idx, $row, $col, $cell) = @_; if ( $prv_row >= 0 and ( $row > $prv_row or $idx > $prv_idx ) ) { $row_f->(@cells) if $row_f; @cells = (); } if ($idx > $prv_idx) { if ($sheet_f) { my $ws = $wb->worksheet($idx); $sheet_f->($ws->{Name}, $ws, $wb); } } push @cells, $cell; ($prv_idx, $prv_row) = ($idx, $row); }; my $xls = Spreadsheet::ParseExcel->new( CellHandler => $handler, NotSetCell => 1, ); $xls->Parse($file); # Last row $row_f->(@cells) if $row_f; } 1; ## And some sample code: #!/usr/bin/perl use strict; use warnings; use Spreadsheet::ParseExcel::CB; my $name; my $process_row = sub { my @vals = map $_->value(), @_; print "$name: @vals\n"; }; my $process_sheet = sub { $name = shift; print "Sheet: $name\n"; }; my $xls = PM::ParseExcel::CB->new( Row => $process_row, Sheet => $process_sheet, ); $xls->parse('spreadsheet.xls')