psini has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I'm stuck on a problem that I know how to solve but I'm sure that it can be done much better.

The task: use OpenOffice to generate a doc that can be used as a template for generating other docs, using perl. Think of it as a fill-in form, but with parts of it that can be removed (conditional) or repeated (iterating on an array/hash) or both. I don't need to programmatically edit the style of the doc, only the text.

The problem is that I want the template editable with OO, so its contents has to be a well-formed XML file and every metadata or command has to be in the text part. So, if I want a line in a table - say the header - removed, I need to specify the conditional command in the text, but the template processor has to remove the entire <table-row> tag from the XML stream.

I *could* do it in perl, opening the zipped doc, extracting and parsing the XML content, and proceesing it with XML::Twig. I don't want to reinvent the wheel, but I don't see how this can be done using TT or other template processor I heard of.

Any idea?

Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Replies are listed 'Best First'.
Re: OpenOffice, XML and templates
by Your Mother (Archbishop) on Jun 29, 2009 at 21:33 UTC

    I'm in the same boat (or will be before year's end) and I've been looking at this for awhile without diving in. I've seen some successful examples extending the package(s) with helpers like this (I think I klept this from some Japanese Perl hacker's blog a couple months back) one for working with Impress docs-

    package OpenOffice::OODoc::Document; sub clone_page { my $self = shift; my ($source, $dest) = @_; my $p = $self->getElement(page_xpath($source)) or die; my $p2 = $p->copy(); $p2->paste(last_child => $self->getElement('//office:presentation' +)); $self->setAttributes($p2, 'draw:name' => "page$dest"); } sub page_xpath { my ($page) = @_; sprintf('//draw:page[@draw:name="page%d"]', $page); }

    The lesson there being that it's XML::Twig underneath so you could use that directly with the OOD objects. You don't need to do any wrapping outside, just get into the guts of the objects directly with what's there already. Please submit patches back to the OOD author if you add anything generally functional.

    A TT approach would be quite doable; perhaps by breaking pieces out into BLOCKs and MACROs so a given doc type could contain all its possible function/content in a glance. Any text munging can be done with TT if you come at it correctly. I like OOD though so the approach you (and I eventually) take should be based on the likelihood of OOD growing and getting better as more of us move away from MSFT-dependent packages... though a TT solution is certainly an interesting idea.

      I'm afraid I don't know TT well enough, so may be that there is a way to do it, but I can't see it.

      As I see it, the problem arises when you want to cut (or replicate) a block. Say you have the following fragment:

      <document> ... <para>paragraph #1</para> <para>paragraph #2</para> <para>paragraph #3</para> ... </document>

      If you want to programmatically cut away the second paragraph you have to surround it with TT commands, but this breaks XML integrity and, worse, it is not editable from OO writer.

      But if you put the command inside the <para> tag you can delete the text but end up with an empty paragraph.

      What I would need is a template language allowing a sort of look-ahead and look-behind (perhaps look-around is the right term?) so I can tell "remove this block and all the surrounding <para> tag". I don't know if TT, or another template processor, can do this.

      Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

        Maybe you want Petal then - it embeds its templating language into the XML making up the document, so you can eliminate whole elements and their children. I never found it too pleasing, as it's only suitable for well-formed XML documents, but that might be a plus in your situation. The theoretical advantage is that you can edit the "sample" content within the templates and OOo will still output the attributes that make up the (Pe)Tal language. I haven't tried this in practice though.

        Back when I had to do templates of Word documents, I channelled most data through Microsoft Office Document Properties, but the templates didn't have a need for fancy tables with a variable amount of rows.

        I assume you've looked at using LaTeX to produce your output already - it's quite powerful but I'm not aware of whether the WYSIWYG editors have improved, as I'm content with the plain text editing.

      I found nothing of adequate to my needs, so I started working on it.

      The main idea is of a module that uses OpenOffice::OODoc::Document to manage the odt file, XML::Twig to manipulate the content, a HoAoH structured data block, and a minimal scripting language to describe the actions to be taken by the parser.

      I just wrote a draft and uploaded it in psini's scratchpad describing the language to implement; if you (or anyone) are interested in this project, any suggestion or critic is welcome: I thought about it in view of my specific needs at present, so I may have forgot some possible use; maybe the structure of the language is too simple or too complex and, last but not least, I know that my English is awful, so any grammar or stylistic correction is welcome too.

      Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: OpenOffice, XML and templates
by zwon (Abbot) on Jun 29, 2009 at 19:48 UTC

    Don't actually worked with it myself, but maybe OpenOffice::OODoc would be useful for you.

      Yes, I forgot to mention it, but it was the other choice for parsing/editing the ODT file.

      Really my question was how to avoid reinventing (yet another) scripting/templating language.

      Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: OpenOffice, XML and templates
by grantm (Parson) on Jul 01, 2009 at 02:16 UTC

    As you no doubt realise, the OpenOffice document format is just a zip file containing (amongst other things) a file called content.xml which is your document. So you can open it up with Archive::Zip and then use whatever XML manipulation tool you like on it.

    I would probably tend toward using XML::LibXML. If you edit the document in OpenOffice and assign a unique style to each block of text that you might want to replace/remove, then you can find the document nodes using an XPath expression to match the style. Then you have the DOM maniplation methods at your disposal to edit the nodes.

    In this example, I've skipped the Archive::Zip step and included the content.xml directly in the __DATA__ section but it illustrates finding a paragraph by matching on its style (in this case I used a style called 'VariableTextSurname'):

    #!/usr/bin/perl use strict; use warnings; use XML::LibXML; use XML::LibXML::XPathContext; my $parser = XML::LibXML->new(); my $doc = $parser->parse_fh(\*DATA); my $xc = XML::LibXML::XPathContext->new( $doc->documentElement() ) +; $xc->registerNs( text => 'urn:oasis:names:tc:opendocument:xmlns:text:1 +.0' ); my $xpath = q{//text:p[@text:style-name="VariableTextSurname"]}; foreach my $p ($xc->findnodes($xpath)) { print "Found a variable para\n " . $p->to_literal . "\n"; # could do e.g.: $p->parentNode->removeChild($p); } # After manipulations, serialise back to XML with: # my $xml = $doc->toString(); exit; __DATA__ <?xml version="1.0" encoding="UTF-8"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1. +0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:rdfa="http://docs.oasis-open.org/opendocument/meta/rdfa#" xmlns:field="urn:openoffice:names:experimental:ooxml-odf-interop:xml +ns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xml +ns:form:1.0" office:version="1.2" ><office:scripts /><office:font-face-decls ><style:font-face style:name="Times New Roman" svg:font-family="&apos; +Times New Roman&apos;" style:font-family-generic="roman" style:font-p +itch="variable" /><style:font-face style:name="Arial" svg:font-family="Arial" style:fo +nt-family-generic="swiss" style:font-pitch="variable" /><style:font-face style:name="DejaVu Sans" svg:font-family="&apos;Dej +aVu Sans&apos;" style:font-family-generic="system" style:font-pitch=" +variable" /></office:font-face-decls ><office:automatic-styles /><office:body ><office:text ><text:sequence-decls ><text:sequence-decl text:display-outline-level="0" text:name="Illustr +ation" /><text:sequence-decl text:display-outline-level="0" text:name="Table" /><text:sequence-decl text:display-outline-level="0" text:name="Text" /><text:sequence-decl text:display-outline-level="0" text:name="Drawin +g" /></text:sequence-decls ><text:p text:style-name="Standard" >Paragraph One</text:p ><text:p text:style-name="VariableTextSurname" >Paragraph Two</text:p ><text:p text:style-name="Standard" >Paragraph Three</text:p ></office:text ></office:body ></office:document-content >

    (I did add some extra whitespace into the XML for readability).

    Although you can add your own attributes to the XML, they seem to disappear if you edit the document using OpenOffice.