As you no doubt realise, the OpenOffice document format is just a zip file containing (amongst other things) a file called content.xml which is your document. So you can open it up with Archive::Zip and then use whatever XML manipulation tool you like on it.

I would probably tend toward using XML::LibXML. If you edit the document in OpenOffice and assign a unique style to each block of text that you might want to replace/remove, then you can find the document nodes using an XPath expression to match the style. Then you have the DOM maniplation methods at your disposal to edit the nodes.

In this example, I've skipped the Archive::Zip step and included the content.xml directly in the __DATA__ section but it illustrates finding a paragraph by matching on its style (in this case I used a style called 'VariableTextSurname'):

#!/usr/bin/perl use strict; use warnings; use XML::LibXML; use XML::LibXML::XPathContext; my $parser = XML::LibXML->new(); my $doc = $parser->parse_fh(\*DATA); my $xc = XML::LibXML::XPathContext->new( $doc->documentElement() ) +; $xc->registerNs( text => 'urn:oasis:names:tc:opendocument:xmlns:text:1 +.0' ); my $xpath = q{//text:p[@text:style-name="VariableTextSurname"]}; foreach my $p ($xc->findnodes($xpath)) { print "Found a variable para\n " . $p->to_literal . "\n"; # could do e.g.: $p->parentNode->removeChild($p); } # After manipulations, serialise back to XML with: # my $xml = $doc->toString(); exit; __DATA__ <?xml version="1.0" encoding="UTF-8"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1. +0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:rdfa="http://docs.oasis-open.org/opendocument/meta/rdfa#" xmlns:field="urn:openoffice:names:experimental:ooxml-odf-interop:xml +ns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xml +ns:form:1.0" office:version="1.2" ><office:scripts /><office:font-face-decls ><style:font-face style:name="Times New Roman" svg:font-family="&apos; +Times New Roman&apos;" style:font-family-generic="roman" style:font-p +itch="variable" /><style:font-face style:name="Arial" svg:font-family="Arial" style:fo +nt-family-generic="swiss" style:font-pitch="variable" /><style:font-face style:name="DejaVu Sans" svg:font-family="&apos;Dej +aVu Sans&apos;" style:font-family-generic="system" style:font-pitch=" +variable" /></office:font-face-decls ><office:automatic-styles /><office:body ><office:text ><text:sequence-decls ><text:sequence-decl text:display-outline-level="0" text:name="Illustr +ation" /><text:sequence-decl text:display-outline-level="0" text:name="Table" /><text:sequence-decl text:display-outline-level="0" text:name="Text" /><text:sequence-decl text:display-outline-level="0" text:name="Drawin +g" /></text:sequence-decls ><text:p text:style-name="Standard" >Paragraph One</text:p ><text:p text:style-name="VariableTextSurname" >Paragraph Two</text:p ><text:p text:style-name="Standard" >Paragraph Three</text:p ></office:text ></office:body ></office:document-content >

(I did add some extra whitespace into the XML for readability).

Although you can add your own attributes to the XML, they seem to disappear if you edit the document using OpenOffice.


In reply to Re: OpenOffice, XML and templates by grantm
in thread OpenOffice, XML and templates by psini

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.