fix ODT files with line breaks looking poor

melutovich has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: fix ODT files with line breaks looking poor by haukex (Archbishop) on Apr 07, 2019 at 15:40 UTC
Using LibreOffice I whipped up the following minimal example, edited down to the minimum needed: `<?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content>` [download] The transform needed of the XML document is something like this: `<r><p>a<x>b<s/>c</x>d</p></r>` <r> `-- <p> `+- a +- <x> \| `+- b \| +- <s> \| `- c `- d [download] To this: `<r><p>a<x>b</x></p><p><x>c</x>d</p></r>` <r> `+- <p> \| `+- a \| `- <x> \| `-- b `- <p> `+- <x> \| `-- c `- d [download] Where I think the elements surrounding `<s/>`, represented here as `<x>`, could even be nested more than one level, i.e. maybe `<r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r>` A very interesting problem, unfortunately I don't have enough time to invest at the moment. You definitely shouldn't try to do this with regexes. If the files aren't too big, I'd probably approach this with XML::LibXML... Update 2: On second thought, a stream-based parser might be easier in this case... hmmm... Read more... (2 kB)	[reply] [d/l] [select]
Re^2: fix ODT files with line breaks looking poor by choroba (Cardinal) on Apr 07, 2019 at 20:45 UTC
> If the files aren't too big, I'd probably approach this with XML::LibXML... And if they are too big, you can probably still use it, because it provides XML::LibXML::Reader. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: fix ODT files with line breaks looking poor by melutovich (Acolyte) on Apr 07, 2019 at 21:55 UTC
Took a stab at Option 1; need to test it more but looks promising. my $dom = XML::LibXML->load_xml(string => <<'END_XML'); <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> <text:p text:style-name="P3"> TEST is a ' real test. </text:p> </office:text></office:body> </office:document-content> END_XML my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs('office', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0'); $xpc->registerNs('text', 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'); while (1) { my ($lb) = $xpc->findnodes('//text:line-break') or last; die "can't handle <text:line-break> with children: $lb" if $lb->hasChildNodes; my ($a) = $xpc->findnodes('ancestor::text:[1]',$lb) or die "failed to find ancestor of $lb"; my ($a_a) = $xpc->findnodes('ancestor::[1]',$a) or die "failed to find ancestor of ancestor of $a"; my $clone_a = $a->cloneNode(0); my $nextSibling = $lb->nextSibling(); while ( $nextSibling ) { my $currentSibling = $nextSibling; $nextSibling = $currentSibling->nextSibling(); $currentSibling = $a->removeChild($currentSibling); $clone_a->addChild($currentSibling); } $a->removeChild($lb); $a_a->insertAfter($clone_a,$a); } print $dom; [download] Which produces <?xml version="1.0"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" office:version="1.2"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o</text:span><text:span text:style-name="T1"> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r</text:span><text:span text:style-name="T5"/> </text:p> <text:p text:style-name="P3"> TEST is a ' real test. </text:p> </office:text></office:body> </office:document-content>	[reply] [d/l]
Re^3: fix ODT files with line breaks looking poor by haukex (Archbishop) on Apr 08, 2019 at 19:59 UTC
Here's my solution, it also handles the more deeply nested cases like `<r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r>`: use warnings; use strict; use XML::LibXML; my $dom = XML::LibXML->load_xml(string => <<'END_XML'); <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/>B</text:span> <text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content> END_XML my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs('office', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0'); $xpc->registerNs('text', 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'); my $breakat = 'text:line-break'; my $ancestor = 'text:p'; while (1) { my ($br1) = $xpc->findnodes("//$ancestor//${breakat}[1]") or last; die "can't handle <$breakat> with children: $br1" if $br1->hasChildNodes; my ($an1) = $xpc->findnodes("ancestor::${ancestor}[1]",$br1) or die "failed to find <$ancestor> ancestor of $br1"; my $an2 = $an1->cloneNode(1); my ($br2) = $xpc->findnodes(".//${breakat}[1]", $an2) or die "internal error: failed to find <$breakat> in $an2"; for ( my $cur = $br1; $cur!=$an1 ; $cur = $cur->parentNode ) { while ( my $s = $cur->nextSibling ) { $s->unbindNode } } $br1->unbindNode; for ( my $cur = $br2; $cur!=$an2 ; $cur = $cur->parentNode ) { while ( my $s = $cur->previousSibling ) { $s->unbindNode } } $br2->unbindNode; $an1->parentNode->insertAfter($an2, $an1); } print $dom; __END__ <?xml version="1.0"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument +:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns: +text:1.0" office:version="1.2"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o</text:span></text:p><text:p te +xt:style-name="P1"><text:span text:style-name="T1">B</text:span> <text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r</text:span></text:p><text:p text +:style-name="P1"><text:span text:style-name="T5"/> </text:p> </office:text></office:body> </office:document-content> [download]	[reply] [d/l] [select]
Re: fix ODT files with line breaks looking poor by roboticus (Chancellor) on Apr 07, 2019 at 14:49 UTC
melutovich: I don't do a lot of XML processing, but if I had to do what you're talking about, I think I'd reach for XML::Twig. It will handle the XML parsing for you, and you can add handlers for recognizing particular tags in which you can edit the XML. You might be able to handle your task by adding a handlers for `<text:p>` and `<text:line-break>` to let you detect the line breaks and break the content into multiple paragraphs with the correct style. Another option might be XML::XSLT to write transformation rules to alter the document. I've used XSLT in projects before, and it worked well. The difficulty I had with it is that it's essentially another language, and since I didn't use it often, each project with it was a learning experience. If you're going to do a lot of transformations it may be worth your while to learn it. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: fix ODT files with line breaks looking poor by melutovich (Acolyte) on Apr 07, 2019 at 15:20 UTC
I'll give this some thought. For your suggestion on using XML::Twig unfortunately it would not be that simple as I discovered the `<text:line-break>` can occur in perhaps various parent/enclosing tags; if I pursue this, I'll have to see if XML::Twig will tell me the parent/enclosing tag during the `<text:line-break>` handler... I see that there is a module OpenOffice::OODoc which uses XML::Twig, however the last release is from 2010. I'll give XML::XSLT a quick look also. Thanks for a quick reply	[reply]
Re: fix ODT files with line breaks looking poor by Jenda (Abbot) on Apr 16, 2019 at 23:32 UTC
I know nothing of ODT and the example presented by haukex fails to open so I can't test whether I broke something, but here's a possible solution using XML::Rules. use strict; use XML::Rules; my $filter = XML::Rules->new( style => 'filter', namespaces => { 'urn:oasis:names:tc:opendocument:xmlns:text:1.0' => 'text', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0' => 'office' }, rules => { _default => 'raw', # we do not care what's inside the tags, # we just want to preserve everything 'text:p' => sub { return $_[0] => $_[1] }, # this doesn't seem + to do anything, # but it's necessary. The filter mode sends everything out +side tags # with special rules directly to output 'text:line-break' => sub { my ($tag, $attrs, $parents, $parentAttrs, $parser) = @_; my $idx = $#$parents; # find the <text:p> tag enclosing th +is one $idx-- while ($idx >=0 && $parents->[$idx] ne 'text:p'); return $tag => $attrs if ($parents->[$idx] ne 'text:p'); # line break outside paragraph, leave alone my $level = $#$parents - $idx + 1; print { $parser->{FH} } $parser->parentsToXML( $level); #output the <text:p> and everything inside we read so far print { $parser->{FH} } $parser->closeParentsToXML( $level +); # close the opened tags all the way to the <text:p> print { $parser->{FH} } "\n"; foreach my $i ($idx .. $#$parents) { # remove the printed +content delete $parentAttrs->[$i]->{_content}; # leaves the at +tributes intact } return; # remove the <text:line-break/> } } ); $filter->filter( \DATA, \STDOUT); __DATA__ <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content> [download] The code will work correctly (provided I understood the requirements right) no matter how many tags are open within the <text:p>. Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l]
Re^2: fix ODT files with line breaks looking poor by haukex (Archbishop) on Apr 17, 2019 at 19:48 UTC
the example presented by haukex fails to open An ODT file is basically a ZIP file that contains a bunch of other files, one of them being `content.xml`, which I extracted and edited down to what I considered a minimal but representative example, which is what I showed. I cared more about the structure of the XML, and I also tested my code on the `<r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r>` `-->` `<r><p>a<x>b<y>c<x>d</x></y></x></p><p><x><y><x>e</x>f</y>g</x>h</p></r>` example.	[reply] [d/l] [select]
Re^3: fix ODT files with line breaks looking poor by Jenda (Abbot) on Apr 18, 2019 at 13:11 UTC
I see. :-) I just saved it with .odt extension and tried to force Word to open it. It was two in the morning. The result is valid XML and looks right according to how I understand the task so let's hope it helps anyone. I think the code is kinda neat. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]