melutovich has asked for the wisdom of the Perl Monks concerning the following question:

We have a system that generates ODT files from a template plus client's response to many questions.

At various locations in the generated ODT file, line breaks get added in however their appearance in the ODT looks poorly formatted.

I've been tasked to in perl search the ODT content for the line breaks and convert them into a close of the parent tag and creation of a new parent tag with the same type and style.

Originally I tried to just use a few regex to split the content.xml (extracted using Archive::Zip), which worked on a few ODT files but is failing on more complicated XML

I was replacing a <text:line-break/> with a </text:p><text:p text:style-name="XXX">

however my solution fails when it encounters

<text:span text:style-name="T71"><text:s/><text:line-break/></text:span>

which became
<text:span text:style-name="T71"><text:s/></text:p><text:p text:style-name="P104"></text:span>

in which it is inserting the </text:p><text:p ...> before the </text:span> was closed.

Probably a regex based solution is too simple to handle the complex XML that might exist, so I probably will need a module that understands ODT and/or XML and with perl code can allow me to

Suggestions?

  • Comment on fix ODT files with line breaks looking poor

Replies are listed 'Best First'.
Re: fix ODT files with line breaks looking poor
by haukex (Archbishop) on Apr 07, 2019 at 15:40 UTC

    Using LibreOffice I whipped up the following minimal example, edited down to the minimum needed:

    <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content>

    The transform needed of the XML document is something like this: <r><p>a<x>b<s/>c</x>d</p></r>

    <r> `-- <p> `+- a +- <x> | `+- b | +- <s> | `- c `- d

    To this: <r><p>a<x>b</x></p><p><x>c</x>d</p></r>

    <r> `+- <p> | `+- a | `- <x> | `-- b `- <p> `+- <x> | `-- c `- d

    Where I think the elements surrounding <s/>, represented here as <x>, could even be nested more than one level, i.e. maybe <r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r>

    A very interesting problem, unfortunately I don't have enough time to invest at the moment. You definitely shouldn't try to do this with regexes. If the files aren't too big, I'd probably approach this with XML::LibXML...

    Update 2: On second thought, a stream-based parser might be easier in this case... hmmm...

      > If the files aren't too big, I'd probably approach this with XML::LibXML...

      And if they are too big, you can probably still use it, because it provides XML::LibXML::Reader.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Took a stab at Option 1; need to test it more but looks promising.
      my $dom = XML::LibXML->load_xml(string => <<'END_XML'); <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> <text:p text:style-name="P3"> TEST is a &apos; real test. </text:p> </office:text></office:body> </office:document-content> END_XML my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs('office', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0'); $xpc->registerNs('text', 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'); while (1) { my ($lb) = $xpc->findnodes('//text:line-break') or last; die "can't handle <text:line-break> with children: $lb" if $lb->hasChildNodes; my ($a) = $xpc->findnodes('ancestor::text:*[1]',$lb) or die "failed to find ancestor of $lb"; my ($a_a) = $xpc->findnodes('ancestor::*[1]',$a) or die "failed to find ancestor of ancestor of $a"; my $clone_a = $a->cloneNode(0); my $nextSibling = $lb->nextSibling(); while ( $nextSibling ) { my $currentSibling = $nextSibling; $nextSibling = $currentSibling->nextSibling(); $currentSibling = $a->removeChild($currentSibling); $clone_a->addChild($currentSibling); } $a->removeChild($lb); $a_a->insertAfter($clone_a,$a); } print $dom;
      Which produces
      <?xml version="1.0"?>
      <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" office:version="1.2">
      <office:body><office:text>
      <text:p text:style-name="P1">
          Fo<text:span text:style-name="T1">o</text:span><text:span text:style-name="T1">
          B</text:span><text:span text:style-name="T3">a</text:span>
          <text:span text:style-name="T5">r</text:span><text:span text:style-name="T5"/>
        </text:p>
      <text:p text:style-name="P3">
        TEST is a ' real test.
        </text:p>
      </office:text></office:body>
      </office:document-content>
      

        Here's my solution, it also handles the more deeply nested cases like <r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r>:

        use warnings; use strict; use XML::LibXML; my $dom = XML::LibXML->load_xml(string => <<'END_XML'); <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/>B</text:span> <text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content> END_XML my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs('office', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0'); $xpc->registerNs('text', 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'); my $breakat = 'text:line-break'; my $ancestor = 'text:p'; while (1) { my ($br1) = $xpc->findnodes("//$ancestor//${breakat}[1]") or last; die "can't handle <$breakat> with children: $br1" if $br1->hasChildNodes; my ($an1) = $xpc->findnodes("ancestor::${ancestor}[1]",$br1) or die "failed to find <$ancestor> ancestor of $br1"; my $an2 = $an1->cloneNode(1); my ($br2) = $xpc->findnodes(".//${breakat}[1]", $an2) or die "internal error: failed to find <$breakat> in $an2"; for ( my $cur = $br1; $cur!=$an1 ; $cur = $cur->parentNode ) { while ( my $s = $cur->nextSibling ) { $s->unbindNode } } $br1->unbindNode; for ( my $cur = $br2; $cur!=$an2 ; $cur = $cur->parentNode ) { while ( my $s = $cur->previousSibling ) { $s->unbindNode } } $br2->unbindNode; $an1->parentNode->insertAfter($an2, $an1); } print $dom; __END__ <?xml version="1.0"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument +:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns: +text:1.0" office:version="1.2"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o</text:span></text:p><text:p te +xt:style-name="P1"><text:span text:style-name="T1">B</text:span> <text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r</text:span></text:p><text:p text +:style-name="P1"><text:span text:style-name="T5"/> </text:p> </office:text></office:body> </office:document-content>
Re: fix ODT files with line breaks looking poor
by roboticus (Chancellor) on Apr 07, 2019 at 14:49 UTC

    melutovich:

    I don't do a lot of XML processing, but if I had to do what you're talking about, I think I'd reach for XML::Twig. It will handle the XML parsing for you, and you can add handlers for recognizing particular tags in which you can edit the XML. You might be able to handle your task by adding a handlers for <text:p> and <text:line-break> to let you detect the line breaks and break the content into multiple paragraphs with the correct style.

    Another option might be XML::XSLT to write transformation rules to alter the document. I've used XSLT in projects before, and it worked well. The difficulty I had with it is that it's essentially another language, and since I didn't use it often, each project with it was a learning experience. If you're going to do a lot of transformations it may be worth your while to learn it.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      I'll give this some thought.

      For your suggestion on using XML::Twig unfortunately it would not be that simple as I discovered the <text:line-break> can occur in perhaps various parent/enclosing tags; if I pursue this, I'll have to see if XML::Twig will tell me the parent/enclosing tag during the <text:line-break> handler...

      I see that there is a module OpenOffice::OODoc which uses XML::Twig, however the last release is from 2010.

      I'll give XML::XSLT a quick look also.

      Thanks for a quick reply

Re: fix ODT files with line breaks looking poor
by Jenda (Abbot) on Apr 16, 2019 at 23:32 UTC

    I know nothing of ODT and the example presented by haukex fails to open so I can't test whether I broke something, but here's a possible solution using XML::Rules.

    use strict; use XML::Rules; my $filter = XML::Rules->new( style => 'filter', namespaces => { 'urn:oasis:names:tc:opendocument:xmlns:text:1.0' => 'text', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0' => 'office' }, rules => { _default => 'raw', # we do not care what's inside the tags, # we just want to preserve everything 'text:p' => sub { return $_[0] => $_[1] }, # this doesn't seem + to do anything, # but it's necessary. The filter mode sends everything out +side tags # with special rules directly to output 'text:line-break' => sub { my ($tag, $attrs, $parents, $parentAttrs, $parser) = @_; my $idx = $#$parents; # find the <text:p> tag enclosing th +is one $idx-- while ($idx >=0 && $parents->[$idx] ne 'text:p'); return $tag => $attrs if ($parents->[$idx] ne 'text:p'); # line break outside paragraph, leave alone my $level = $#$parents - $idx + 1; print { $parser->{FH} } $parser->parentsToXML( $level); #output the <text:p> and everything inside we read so far print { $parser->{FH} } $parser->closeParentsToXML( $level +); # close the opened tags all the way to the <text:p> print { $parser->{FH} } "\n"; foreach my $i ($idx .. $#$parents) { # remove the printed +content delete $parentAttrs->[$i]->{_content}; # leaves the at +tributes intact } return; # remove the <text:line-break/> } } ); $filter->filter( \*DATA, \*STDOUT); __DATA__ <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content>

    The code will work correctly (provided I understood the requirements right) no matter how many tags are open within the <text:p>.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      the example presented by haukex fails to open

      An ODT file is basically a ZIP file that contains a bunch of other files, one of them being content.xml, which I extracted and edited down to what I considered a minimal but representative example, which is what I showed. I cared more about the structure of the XML, and I also tested my code on the <r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r> --> <r><p>a<x>b<y>c<x>d</x></y></x></p><p><x><y><x>e</x>f</y>g</x>h</p></r> example.

        I see. :-) I just saved it with .odt extension and tried to force Word to open it. It was two in the morning.

        The result is valid XML and looks right according to how I understand the task so let's hope it helps anyone. I think the code is kinda neat.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.