Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Moving a tag within text with XML::Twig

by skillet-thief (Friar)
on Sep 25, 2005 at 08:25 UTC ( [id://494887]=perlquestion: print w/replies, xml ) Need Help??

skillet-thief has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks --

I am working with OpenOffice::OODoc and therefore XML::Twig to modify a Writer document. I am trying to move footnotes in relation to the surrounding text. The text representation of the XML would be something like this:

Sample text:

this, that and the other thing.<text:span text:style-name="footnote reference"> ...rest of footnote code </text: +span> maybe more text here.

I need to be able to move the <text:span></text:span> element to the left or to the right in the text. (More specifically, this is designed to move the footnote before any punctuation, in accordance with local typographical rules.)

Here is my best shot so far, which works but seems a bit too much like a hack, mostly due to use of the subs_text XML::Twig method in an unexpressive way, making me suspect that there might be a more robust solution:

#!/usr/local/bin/perl use strict; use warnings; use OpenOffice::OODoc; my $document = ooDocument( file => "testdoc.sxw"); ## search recursively for footnote elements ## (I think both XML::Twig and OpenOffice::OODoc ## have methods for doing this, but this is working for me now... sub foot_find { my ( $elem, $arrayref ) = @_; unless ( ref $elem ) { return; } if ( $elem->att("text:style-name") && ($elem->att("text:style-name") eq "footnote reference" )) + { push( @$arrayref, $elem ); } my @entities = $elem->children("text:span"); foreach ( @entities ) { foot_find( $_ , $arrayref ); } } # This returns a list of XML::Twig::Elt objects my @all = $document->getTextElementList; # build list of footnote elements my @foots; foreach ( @all ) { foot_find( $_, \@foots ); } #select footnotes with punctuation problems grep { $_->prev_sibling_text =~ m<([,.])\s*$> } @foots; foreach ( @foots ) { my $pre = $_->sibling(-1); my $post = $_->sibling(1); unless ( $pre && $post ) { print "problem with pre or post\n"; next; } ### this is a problem, because sometimes ### the next element does not contain text unless ( $pre->text && $post->text ){ print "No text around before and-or after\n"; next; } ### this is the crux of the whole thing, and is what I would ### like to improve. The subs_text method seems like a hack in ### this case. my $replace; my $simple_punct = qr/([,.])\s*$/; if ( $pre->text =~ $simple_punct ) { $replace = $1; } print $replace."\n" if $replace; $pre->subs_text( $simple_punct, '' ); $post->subs_text( qr/^/, "$replace" ); } $document->save("output.sxw");

I am having trouble putting my head around this. It seems like there should be a way to say more clearly that I just want to move the tag up or down within its parent text.

Advance thanks for any insights.

s-t

sub jf { print substr($_[0], -1); jf( substr($_[0], 0, length($_[0])-1)) if length $_[0] > 1; } jf('gro.alubaf@yehaf');

Replies are listed 'Best First'.
Re: Moving a tag within text with XML::Twig
by Tanktalus (Canon) on Sep 25, 2005 at 15:08 UTC

    First off, to find all of those tags, you should check get_xpath:

    my @foots = $document->get_xpath('//text:span[@text:style-name="footno +te reference"]');
    Way simpler than rolling your own ;-).

    Rather than using subs_text, you may just want to use set_text to set the text to whatever you've already figured out. But I think your idea is sound - you want to remove some text from one side, and add it to the other.

Re: Moving a tag within text with XML::Twig
by mirod (Canon) on Sep 25, 2005 at 23:03 UTC

    Here is an example of how I would do this: I would use get_xpath to find the footnotes (in the proper context if need be), the mark method to create an element holding the punctuation, and then prefix to add the text to the next element (in the case where there is no next text element I effectively create one).

    Hopefully the code below can be adapted to your problem. Note that it is not perfect, for example it would trip on i.e. and e.g., to mention a problem that seems to have caused some concer on p5p.;--)

    #!/usr/bin/perl use strict; use warnings; use Test::More tests => 1; use XML::Twig; $/="\n\n"; my $doc= <DATA>; my $expected=<DATA>; my $t= my_twig->new( pretty_print => 'indented', elt_class=> 'my_elt', # so I can write $elt +->fix_style ) ->parse( $doc) ->fix_style( 'p', 'ft') ; is( $t->sprint, $expected, 'all tests'); package my_twig; use base 'XML::Twig'; sub fix_style { my( $t, $para, $ftnote)= @_; # the get_xpath here gets all footnaotes in the proper context foreach my $elt ($t->get_xpath( "$para/$ftnote")) { $elt->fix_style(); } return $t; } package my_elt; use base 'XML::Twig::Elt'; sub fix_style { my( $ftnote)= @_; if( my $prev_text= $ftnote->prev_sibling_is( '#PCDATA')) { # mark will put the text matched in a new elt # and return the list of newly created elements if( my @new_elt= $prev_text->mark( qr/(?<![A-Z])([.,?:!]+)$/)) { my $punct= shift @new_elt; # punt is a 'p' element if( my $next_elt= $ftnote->next_sibling_is( '#PCDATA')) { # there is text after the ftnore, prefix it with the p +unctuation $next_elt->prefix( $punct->text); } else { # no text after the ftnote, move the text (#PCDATA) of + the # newly created element after the ftnote $punct->first_child( '#PCDATA')->move( after => $ftnot +e); } # no need to keep the new element around, it's been used $punct->delete; } } } package main; __END__ <doc> <p>text.<ft/>more text</p> <p>nothing <ft/> to see here</p> <p>this.<ft/></p> <p>text,<ft/>more text</p> <p>text...<ft/>more text</p> <p>text T.E.X.T.<ft/>more text</p> </doc> <doc> <p>text<ft/>.more text</p> <p>nothing <ft/> to see here</p> <p>this<ft/>.</p> <p>text<ft/>,more text</p> <p>text<ft/>...more text</p> <p>text T.E.X.T.<ft/>more text</p> </doc>
Re: Moving a tag within text with XML::Twig
by graff (Chancellor) on Sep 25, 2005 at 15:42 UTC
    Your use of grep in a void context looks like a mistake. You realize that this does not change the content of @foots, don't you? I expect you want the list returned by grep to be the iterator in the following for loop:
    for ( grep { $_->prev_sibling_text =~ m<([,.])\s*$> } @foots ) { ...
    As for cases where the following sibling has no text (note: there are "prev_sibling()" and "next_sibling()" methods for elt objects), it may be a matter of understanding your document structure to determine whether you need to insert a new elt as a next sibling in order to move the punctuation text into that, or whether you can just assign text content (the punctuation) to the next sibling, whatever it may be.

    (I suspect that inserting a new elt as the next sibling will be the way to go -- it covers both cases of "no next sibling" and "next sibling has no text". Or maybe you should review the "local typographical rules" to see if maybe the footnote does not need to be placed before punctuation when no text follows. Frankly, I think the "rule" you stated is suspect -- having footnotes after punctuation seems like a common practice.)

    In any case, are you sure that looking just for final period or comma is sufficient? What about question/exclamation marks, quotation marks, parens, and any combination thereof?

      Your use of grep in a void context looks like a mistake. You realize that this does not change the content of @foots, don't you? I expect you want the list returned by grep to be the iterator in the following for loop:
      + for ( grep { $_->prev_sibling_text =~ m<([,.])\s*$> } @foots ) {

      A mistake, or another rung on the learning curve. Or both. Thanks.

      As for cases where the following sibling has no text (note: there are "prev_sibling()" and "next_sibling()" methods for elt objects), it may be a matter of understanding your document structure to determine whether you need to insert a new elt as a next sibling in order to move the punctuation text into that, or whether you can just assign text content (the punctuation) to the next sibling, whatever it may be.

      This is my problem, and this is what I don't quite understand. Suppose I had data like this:

      <a>Text text <b>Footnote code</b></a>

      How, with XML::Twig do I add text between the </b> and </a> tags? Maybe I don't understand XML terminology well enough, but I kind of thought that adding an element woud mean doing something like this:

      <a>Text text <b>Footnote code</b><a>The new text</a></a>

      In other words, how do I specify that the child I want to add to the original <a> element is just a text element?

      Or maybe you should review the "local typographical rules" to see if maybe the footnote does not need to be placed before punctuation when no text follows. Frankly, I think the "rule" you stated is suspect -- having footnotes after punctuation seems like a common practice.)

      This is all in French, and French typography rules want the footnote reference before the punctuation. This is a little-known rule, even amongst French authors, which is partly why I'm building this filter.

      In any case, are you sure that looking just for final period or comma is sufficient? What about question/ exclamation marks, quotation marks, parens, and any combination thereof?

      Exactly. This is phase one of figuring out how to do this. Next comes all the fun dealing with the different punctuation possibilities. Quotations marks (»), question marks, colons, semicolons, and exclamation points all must be preceeded by a non-breaking space.

      sub jf { print substr($_[0], -1); jf( substr($_[0], 0, length($_[0])-1)) if length $_[0] > 1; } jf('gro.alubaf@yehaf');
Re: Moving a tag within text with XML::Twig
by graff (Chancellor) on Sep 25, 2005 at 23:38 UTC
    In case mirod's suggestion doesn't pan out for you, here's an alternative that more or less follows your original idea of relocating the punctuation text relative to the footnote element.

    (I also had some trouble getting my head around this -- realizing that the tree created by XML::Twig includes both objects that represent whole markup units and objects that hold the text content ((P)CDATA), if any, that precede and/or follow each whole markup unit. <update> For that matter, I still don't quite grok "get_xpath" -- but it looks like I got it working for this case.</update>)

    I hope the example will be clear/easy enough for you to adapt to your particular needs.

    #!/usr/bin/perl use strict; use XML::Twig; my @xmltst = ("<doc>Doc 1<a>Some text.<b>footnote text</b></a></doc>", "<doc>Doc 2<a>(More text.)<b>next footnote</b> QED</a></ +doc>", "<doc>Doc 3<a>\"More text?\"<b>3rd footnote</b></a><a> Q +ED</a></doc>", "<doc>Doc 4<a>('More text!')<b>4th note</b> QED.</a><a>F +inis.</a></doc>", ); for my $doc ( @xmltst ) { my $twig = XML::Twig->new(); $twig->parse( $doc ); my @b_elts = $twig->get_xpath('//b'); print "\ndoc contains ",scalar @b_elts," footnotes:\n"; $twig->print; for my $b ( @b_elts ) { my $prev = $b->prev_sibling; my $prev_text = $prev->text; if ( $prev_text =~ /([.,?!\)\"\']+)$/ ) { my $punct = $1; my $offset = length( $prev_text ) - length( $punct ); if ( $b->next_sibling ) { my $next = $b->next_sibling; $prev->set_text( substr( $prev_text, 0, $offset )); $next->set_text( $punct . $next->text ); } else { my $next = $prev->split_at( $offset ); $next->cut; print "\n created new elt containing: ".$next->text; $next->paste( 'after', $b ); } } } print "\n AFTER EDITING:\n"; $twig->print; $twig->dispose; print "\n"; } __OUTPUT__ doc contains 1 footnotes: <doc>Doc 1<a>Some text.<b>footnote text</b></a></doc> created new elt containing: . AFTER EDITING: <doc>Doc 1<a>Some text<b>footnote text</b>.</a></doc> doc contains 1 footnotes: <doc>Doc 2<a>(More text.)<b>next footnote</b> QED</a></doc> AFTER EDITING: <doc>Doc 2<a>(More text<b>next footnote</b>.) QED</a></doc> doc contains 1 footnotes: <doc>Doc 3<a>"More text?"<b>3rd footnote</b></a><a> QED</a></doc> created new elt containing: ?" AFTER EDITING: <doc>Doc 3<a>"More text<b>3rd footnote</b>?"</a><a> QED</a></doc> doc contains 1 footnotes: <doc>Doc 4<a>('More text!')<b>4th note</b> QED.</a><a>Finis.</a></doc> AFTER EDITING: <doc>Doc 4<a>('More text<b>4th note</b>!') QED.</a><a>Finis.</a></doc>

    (update: I realized, after posting this code, that it's bad form to use "$b" as a lexically-scoped scalar like this -- no harm done in this example, since I'm not using "sort", so I'll leave it as-is. But I should have known better.)

      (I also had some trouble getting my head around this -- realizing that the tree created by XML::Twig includes both objects that represent whole markup units and objects that hold the text content ((P)CDATA), if any, that precede and/or follow each whole markup unit.

      Just a comment on this: all complete models of XML trees (like the DOM) store the text itself in separate nodes. Otherwise you can't really process mixed content very well. Tools like XML::Simple do not cope very well with mixed content.

      In a lot of models, text nodes and element nodes are in 2 different classes, but for simplicity's sake I chose to make them just 1 class, which might have been a mistake.

      Does it make sense?

      Many thanks to all of you. This is great!

      These are exactly the kinds of insights that I was hoping for. It will take me some time to work them out. I'll put whatever I come up with on my scratchpad.

      s-t

      sub jf { print substr($_[0], -1); jf( substr($_[0], 0, length($_[0])-1)) if length $_[0] > 1; } jf('gro.alubaf@yehaf');

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://494887]
Approved by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-03-29 06:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found