jinnicky has asked for the wisdom of the Perl Monks concerning the following question:

I have a couple of hundred .odf files. I would like to get the title and author from each of them.

I've been working with OpenOffice::OODoc. I see hints that there is some way to search the document, but no good description.

They all have lines with styles of 'Title' or 'Author', but these may be Pnn internally.

  • Comment on Search OpenOffice document for title and author [Solved]

Replies are listed 'Best First'.
Re: Search OpenOffice document for title and author
by kcott (Archbishop) on Feb 25, 2016 at 05:31 UTC

    G'day jinnicky,

    The module you'll want for that is OpenOffice::OODoc::Meta.

    I haven't previously used this. I did install OpenOffice::OODoc some time ago and the OpenOffice::OODoc::Meta, along with other OpenOffice::OODoc::* modules, appear to be bundled with this (see OpenOffice-OODoc distribution details). So, if you have OpenOffice::OODoc, you probably also have the related modules.

    The documentation looks good and usage seems straightforward.

    I created a very basic text document for testing (pm_1156099_test.odt) and added a title ("PM 1156099 Test Document") via the Properties menu item. I then created this test script:

    #!/usr/bin/env perl -l use strict; use warnings; use OpenOffice::OODoc::Meta; my $meta = OpenOffice::OODoc::Meta::->new(file => 'pm_1156099_test.odt +'); print 'Author: ', $meta->creator(); print 'Title: ', $meta->title(); print 'Created: ', $meta->date();

    This produced this output:

    Author: Ken Cotterill Title: PM 1156099 Test Document Created: 2016-02-25T15:23:56

    There's lots of other metadata you can access if you want.

    [I do recall hearing something about OpenOffice::OODoc being superceded by ODF::lpOD. Both sets of modules are by the same author, Jean-Marie Gouarné, and the ODF::lpOD distribution is more recent. I looked around for some definitive information on this but was unsuccessful, so that remains unconfirmed: perhaps another monk can provide something more substantial on this matter.]

    — Ken

      The meta data doesn't show the title which is a paragraph with the style of 'Title' or Psomething.

      I installed ODF::lpOD once I figured out that lpOD starts with lower case L. It's documentation is voluminous and not much better than OpenOffice::OODoc's.

      However I did get it to work!

      #!/usr/bin/perl -w use strict; use ODF::lpOD; my $file = $ARGV[0]; die "You must supply an odf file name\n" unless $file; my $doc = odf_document->get($file) or die "Can't load $file\n$!\n"; my $context = $doc->body; my $meta = $doc->meta; # Doesn't do what I wanted my $title = $meta->get_title; print "Title: $title\n" if $title; # shows Title: c if at all my $p = $context->get_paragraph(style=>'Title', content=>'ODF',positio +n=>0); if ($p) {print $p->get_text()."\n";} else {print "Not found\n";} # prints Not found #this works my ($i,$ps,$style,$pStyle); for ($i = 0; $i < 6; $i++) { $p = $context->get_paragraph(position=>$i); if ($p) { print "Paragraph $i"; $ps = $p->get_style(); if ($ps) { if ($ps =~ m/^P\d+$/) { # Check for internal styles $style = $doc->get_style('paragraph',$ps) || ''; $pStyle = $style->get_parent_style if $style; $ps = $pStyle if $pStyle; # This gets the real name } print " Style: $ps\n"; } else {print "No Style\n";} my $data = $p->get_text(recursive=>1); if ($data) {print "$data\n";} else {print "--No data\n";} } else { print "Paragraph $i not found\n"; } }

      Thanks Ken

      —Bob

        The ODF::LpOD modules worked on .odf files but choked on .sxw (version 1.1) files.

        Ken's suggestion about the similarity between those modules and the OpenOffice::OODoc modules helped me to go back to them and come up with this code:

        #!/usr/bin/perl use warnings; use strict; use OpenOffice::OODoc; # setup file and styles my $file = $ARGV[0]; die "You must supply an odf file name\n" unless $file; print "File: $file\n"; my $container = odfContainer($file) or die "Can't get document $file\n +"; my $doc = odfDocument(container => $container, part => 'content' +) or die "Can't get content in $file\n"; my ($i,$element,$text,$style); for ($i = 0; $i < 10; $i++) { $element = $doc->getElement('//text:p',$i); if ($element) { $text = $doc->getText($element); if ($text) { $style = $doc->getAttribute($element,'style name')||''; if ($style && ($style =~ m/^P\d+$/)) { $style = $doc->getAncestorStyle($style); } print "Paragraph $i: "; print "($style) " if $style; print "$text" if $text; print "\n"; } } }

        —Bob