mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, PerlMonks.

I have run into a puzzle with XML::Twig where I am looking for specific elements and then need to find the text adjacent to those elements. The code snippet below illustrates the puzzle with one such sought after element and adjacent text in its __DATA__ section.

#!/usr/bin/perl use XML::Twig; use strict; use warnings; my $xml = XML::Twig->new( twig_handlers => { 'text:bookmark' => \&handler_bookmark } ); $xml->parse(\*DATA); print qq(\n\n); $xml->print; exit(0); sub handler_bookmark { my ($twig, $bookmark)= @_; $bookmark->parent->print; } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <text:h text:style-name="P900" text:outline-level="3"> <text:bookmark text:name="_asdfqwerzxcv"/>Foo bar </text:h>

The two output items should be identical but are not. Specifically the string "Foo bar" is missing from the first output which has its origin in the handler_bookmark handler subroutine. I would expect that ->parent would still contain the text it started with, but it does not. Using ->parent->text does not retrieve the string either. Nor does using ->parent_text find it either.

What can be done using XML::Twig to find the text "next to" an element?

Replies are listed 'Best First'.
Re: XML::Twig not finding an element's parent's text
by choroba (Cardinal) on May 18, 2025 at 17:49 UTC
    When Twig is processing the bookmark element, it hasn't yet seen the text. It only knows the part of the parent up to the element itself. That's how SAX-like parsers work. You can try adding text before the bookmark element to verify Twig prints it out.

    You can set the handler expression to text:h[text:bookmark] (i.e. "an h element with a bookmark child") instead and print the h directly instead of the parent.

    If you need more refined navigation, switch to XML::LibXML.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Thanks! That helped.

      You can try adding text before the bookmark element to verify Twig prints it out.

      That sounded promising but has no effect. Whether before or after the selected element, the text is not available.

      You can set the handler expression to text:h[text:bookmark] (i.e. "an h element with a bookmark child") instead and print the h directly instead of the parent.

      The thing is that many kinds of elements may contain <text:bookmark text:name="..."/> so the search has to be on text:bookmark as far as I can tell. However, it looks like a slight modification is the way to go: If I select *[text:bookmark] and then go digging deeper from there, that could work:

      #!/usr/bin/perl use XML::Twig; use strict; use warnings; my $xml = XML::Twig->new( twig_handlers => { '*[text:bookmark]' => \&handler_bookmark } ); # twig_handlers => { 'text:bookmark' => \&handler_bookmark } ); $xml->parse(\*DATA); print qq(\n-\n); $xml->print; exit(0); sub handler_bookmark { my( $twig, $bookmark)= @_; print qq(OK\n); print $bookmark->text; my @bmk = $bookmark->children('text:bookmark'); foreach my $b (@bmk) { my $anchor = $b->att('text:name'); print "Anchor: ", $anchor, "\n"; } } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <text:h text:style-name="P900" text:outline-level="3"> Bar foo <text:bookmark text:name="_asdfqwerzxcv"/>Foo bar </text:h>

      I'll test and get back in a day or so.

        > That sounded promising but has no effect. Whether before or after the selected element, the text is not available.

        I probably wasn't clear enough. This was not an advice how to solve the problem, it was an attempt to show you how Twig behaves.

        #!/usr/bin/perl use warnings; use strict; use XML::Twig; my $xml = XML::Twig->new( twig_handlers => { 'text:bookmark' => \&handler_bookmark } ); $xml->parse(\*DATA); print qq(\n\n); # $xml->print; exit(0); sub handler_bookmark { my ($twig, $bookmark)= @_; $bookmark->parent->print; } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <text:h text:style-name="P900" text:outline-level="3"> BEFORE<text:bookmark text:name="_asdfqwerzxcv"/>Foo bar </text:h>
        Output:
        <text:h text:outline-level="3" text:style-name="P900"> BEFORE<text:bookmark text:name="_asdfqwerzxcv"/></text:h>
        See? "BEFORE" is there, while "Foo bar" is not.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]