Parsing OpenOffice Spreadsheets

I needed to pull some data out of an OpenOffice spreadsheet for formatting. My first attempt was to save the spreadsheets as CSV and used Text::CSV_XS. Unfortunately, Text::CSV_XS chokes on both 8859-1 and utf8 text - at least it does if you forget to set the binary attribute to true :). So, I figured that parsing the OO spreadsheet directly would be the better solution. Now, I'm no XML head, but after reading all about it, I came up with two methods of doing this. One using XML::Twig which is, IMHO, more elegant, and this one using XML::Simple which is far, far terser.

Two preliminary points to note are 1) OpenOffice files are Zip archives 2) the content is contained in the archive in a file called 'content.xml'.

The meat of the code is in the line which starts "my %tables". Working from the bottom. The bottomline pulls out all the tables from the document. This is an array of tables. The text line makes sure that we have an 'array'. This is necessary because if we have an empty table XML::Simple parses it to a hash rather than an array and that breaks all the subsequent code.

The next line pulls out the tables rows (which are an array ref) and the cells in each row (also an array ref). This latter is an array ref of hashes. We only want the 'text:p' element. We then need to turn it back to an array ref (i.e. an array of cells) and put the whole lot back into another array ref (i.e. an array of rows) before turning into a hash of array refs. The hash keys are the table names. The function returns a hash ref of array refs of array refs. This makes further processing (e.g. printing out address labels) very easy.

I haven't tested this code with nested tables so it might not behave as expected, but it meets my current needs.

Hopefully others might also find it useful.

UPDATE: amended reference to Text::CSV_XS

package OOXMLSimple;

use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
use XML::Simple;

use base 'Exporter';

@EXPORT = qw/parse_tables/;

sub parse_tables {
  my $file = shift;
  my $zip = Archive::Zip->new();
  die "Can't open $file"  unless $zip->read( $file ) == AZ_OK;
  my $content = $zip->contents('content.xml');

  my %tables = map {$_->{'table:name'} => [
         map{ [ map { $_->{'text:p'}  } @$_ ] }
         map {$_->{'table:table-cell'} } @{  $_->{'table:table-row'}} 
+] }
         grep {ref $_->{'table:table-row'} eq 'ARRAY' }
         @{  XMLin($content)->{'office:body'}->{'table:table'} };




  return \%tables;
}

1;
[download]

Comment on Parsing OpenOffice Spreadsheets Download Code

Replies are listed 'Best First'.
Re: Parsing OpenOffice Spreadsheets by jmcnamara (Monsignor) on Nov 17, 2005 at 11:26 UTC
It may also be worth mentioning that there are some CPAN modules for parsing OOo spreadsheets: Spreadsheet::ReadSXC, OpenOffice::OOSheets and Spreadsheet::Read. -- John.	[reply] [d/l]
Re: Parsing OpenOffice Spreadsheets by jZed (Prior) on Nov 17, 2005 at 15:48 UTC
Unfortunately, Text::CSV_XS chokes on both 8859-1 and utf8 text. Did you set binary => 1, which is required to use extended character sets? From the docs: "Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode all characters are accepted, at least in quoted fields" Yes, I know "binary" does not mean what one thinks it means here (inconceivable!) but the docs are pretty clear on this point otherwise.	[reply]
Re^2: Parsing OpenOffice Spreadsheets by Nomad (Pilgrim) on Nov 17, 2005 at 15:58 UTC
Did you set binary => 1, which is required to use extended character sets? No, I didn't. Partly because I didn't understand that 'binary' in Text::CSV_XS meant non-ascii characters, but then I didn't look into CSV as deeply as perhaps I might have because I thought I'd better get to grips with the XML and use a pure UTF-8 solution. The beauty of openoffice is that it uses an open document format - I thought it was about high-time I got my head around it and made some way to understanding it. The next task is, of course, to do the same thing with gnumeric, but last time I looked the document format wasn't as well documented.	[reply]
Re^3: Parsing OpenOffice Spreadsheets by jZed (Prior) on Nov 17, 2005 at 16:02 UTC
Yep, I agree with your choice. Text::CSV_XS is not the right tool for your job. I'm just correcting the FUD in case someone is thinking of using it for a different task :-).	[reply]