Two preliminary points to note are 1) OpenOffice files are Zip archives 2) the content is contained in the archive in a file called 'content.xml'.
The meat of the code is in the line which starts "my %tables". Working from the bottom. The bottomline pulls out all the tables from the document. This is an array of tables. The text line makes sure that we have an 'array'. This is necessary because if we have an empty table XML::Simple parses it to a hash rather than an array and that breaks all the subsequent code.
The next line pulls out the tables rows (which are an array ref) and the cells in each row (also an array ref). This latter is an array ref of hashes. We only want the 'text:p' element. We then need to turn it back to an array ref (i.e. an array of cells) and put the whole lot back into another array ref (i.e. an array of rows) before turning into a hash of array refs. The hash keys are the table names. The function returns a hash ref of array refs of array refs. This makes further processing (e.g. printing out address labels) very easy.
I haven't tested this code with nested tables so it might not behave as expected, but it meets my current needs.
Hopefully others might also find it useful.
UPDATE: amended reference to Text::CSV_XS
package OOXMLSimple; use Archive::Zip qw( :ERROR_CODES :CONSTANTS ); use XML::Simple; use base 'Exporter'; @EXPORT = qw/parse_tables/; sub parse_tables { my $file = shift; my $zip = Archive::Zip->new(); die "Can't open $file" unless $zip->read( $file ) == AZ_OK; my $content = $zip->contents('content.xml'); my %tables = map {$_->{'table:name'} => [ map{ [ map { $_->{'text:p'} } @$_ ] } map {$_->{'table:table-cell'} } @{ $_->{'table:table-row'}} +] } grep {ref $_->{'table:table-row'} eq 'ARRAY' } @{ XMLin($content)->{'office:body'}->{'table:table'} }; return \%tables; } 1;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parsing OpenOffice Spreadsheets
by jmcnamara (Monsignor) on Nov 17, 2005 at 11:26 UTC | |
|
Re: Parsing OpenOffice Spreadsheets
by jZed (Prior) on Nov 17, 2005 at 15:48 UTC | |
by Nomad (Pilgrim) on Nov 17, 2005 at 15:58 UTC | |
by jZed (Prior) on Nov 17, 2005 at 16:02 UTC |