I needed to pull some data out of an OpenOffice spreadsheet for formatting. My first attempt was to save the spreadsheets as CSV and used Text::CSV_XS. Unfortunately, Text::CSV_XS chokes on both 8859-1 and utf8 text - at least it does if you forget to set the binary attribute to true :). So, I figured that parsing the OO spreadsheet directly would be the better solution. Now, I'm no XML head, but after reading all about it, I came up with two methods of doing this. One using XML::Twig which is, IMHO, more elegant, and this one using XML::Simple which is far, far terser.

Two preliminary points to note are 1) OpenOffice files are Zip archives 2) the content is contained in the archive in a file called 'content.xml'.

The meat of the code is in the line which starts "my %tables". Working from the bottom. The bottomline pulls out all the tables from the document. This is an array of tables. The text line makes sure that we have an 'array'. This is necessary because if we have an empty table XML::Simple parses it to a hash rather than an array and that breaks all the subsequent code.

The next line pulls out the tables rows (which are an array ref) and the cells in each row (also an array ref). This latter is an array ref of hashes. We only want the 'text:p' element. We then need to turn it back to an array ref (i.e. an array of cells) and put the whole lot back into another array ref (i.e. an array of rows) before turning into a hash of array refs. The hash keys are the table names. The function returns a hash ref of array refs of array refs. This makes further processing (e.g. printing out address labels) very easy.

I haven't tested this code with nested tables so it might not behave as expected, but it meets my current needs.

Hopefully others might also find it useful.

UPDATE: amended reference to Text::CSV_XS

package OOXMLSimple; use Archive::Zip qw( :ERROR_CODES :CONSTANTS ); use XML::Simple; use base 'Exporter'; @EXPORT = qw/parse_tables/; sub parse_tables { my $file = shift; my $zip = Archive::Zip->new(); die "Can't open $file" unless $zip->read( $file ) == AZ_OK; my $content = $zip->contents('content.xml'); my %tables = map {$_->{'table:name'} => [ map{ [ map { $_->{'text:p'} } @$_ ] } map {$_->{'table:table-cell'} } @{ $_->{'table:table-row'}} +] } grep {ref $_->{'table:table-row'} eq 'ARRAY' } @{ XMLin($content)->{'office:body'}->{'table:table'} }; return \%tables; } 1;

In reply to Parsing OpenOffice Spreadsheets by Nomad

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.