unknown_varmit has asked for the wisdom of the Perl Monks concerning the following question:

I'm attempting to extract text from mixed content xml, modify that text, and then replace it back into an xml document at the correct position. The difficulty comes when setting text that contains mixed content...doing so eliminates child tags.
Example text: <paragraph> Some <bold>text</bold> here which may be any <bold>length< +/bold> and <bold>contain</bold> a number of child tags.</paragraph>
Here is the code I am developing. The &choiceReplace sub simply searches through the passed argument string and replaces matched words (which will most always alter the length of the string).
use XML::Twig; my $twig = new XML::Twig(TwigHandlers => { 'bold' => \&bold, 'p' => \ +&paragraph, 'li' => \&ordered_list},TwigRoots => {body => 1}); sub bold { my ($twig, $bold,) = @_; my $bold_text = $bold->text; &choiceReplace($bold_text,$file); $bold->set_text($bold_text); } sub paragraph { my ($twig, $para,) = @_; my $para_text = $para->text; &choiceReplace($para_text,$file); $bold->set_text($bold_text); }
The child text from the paragraph tag may contain replaced search words. When this is true, the set_text command replaces the <bold> tags with just a text string. I understand why this occurs, but is there a better method of cycling through the <paragraph> children text without affecting the <bold> tags? It seems to me that I need to move through each individual child tag by tag. Is this possible without knowing what tags/text I might encounter? Will the mixed content allow me to alter just the <paragraph> text? And still maintain the <bold> tags?

Any guidance is much appreciated.

Replies are listed 'Best First'.
Re: Twig Mixed Content Child Text Replace Issues
by ajguitarmaniac (Sexton) on Feb 08, 2011 at 09:27 UTC

    I'm not exactly an expert on XML processing, however I did some reading myself, on seeing your question. Here's a post XML::Twig tag conversion similar to yours which has been answered by Michel Rodriguez himself (author of the XML::Twig module). The approach given here would be of help to your question as well.

Re: Twig Mixed Content Child Text Replace Issues
by mirod (Canon) on Feb 08, 2011 at 10:04 UTC

    I am not sure what you want to do exactly. If you could provide a better text case, that is code I can run,, with input and expected output, then I would have a better chance at giving you an answer. It would essentially convert your explanations into code, so there is much less chance for misinterpretation. I suspect the solution involves either subs_text or mark but I can't say for sure.

    A example test would be:

    #!/usr/bin/perl use strict; use warnings; use Test::More tests => 1; use XML::Twig; my $doc='<paragraph> Some <bold>text</bold> here which may be any <bol +d>length</bold> and <bold>contain</bold> a number of child tags.</par +agraph>'; my $expected= '<paragraph> Some <bold>text</bold> here which <bold>may +</bold> be any <bold>length</bold> and <bold>contain</bold> a number +of <bold>child</bold> tags.</paragraph>'; my $t=XML::Twig->new->parse( $doc); $t->root->mark( qr/(may|child)/, 'bold'); is( $t->sprint, $expected, 'simple replace');
      Thanks for replies.

      So I'm attempting to alter the string text of <paragraph> without affecting the context of the <bold> tags. Right now, the '->text' command retrieves the text of all sub-elements. Hence, when I set the <paragraph> text, the code is replacing the <bold> elements with a string. I lose the <bold> tag.

      I need to maintain the integrity of the <bold> tags while making string substitutions within the <paragraph> element. I will not always know how the text and text children of the <paragraph> element will look. So the content will vary.

      I need to be able to find all of the text children of a mixed content <paragraph> element and substitute within the text string without disturbing other child elements.

      Example:
      Input: <paragraph> Some <bold>text</bold> here which may be any <bold>length< +/bold> and <bold>contain</bold> a number of child tags.</paragraph> Expected output: <paragraph> Some <bold>text</bold> here which may be any <bold>length< +/bold> and <bold>contain</bold> a quantity of child tags.</paragraph> Code: use XML::Twig; my $file = '<paragraph> Some <bold>text</bold> here which may be any < +bold>length</bold> and <bold>contain</bold> a number of child tags.</ +paragraph>'; my $twig = new XML::Twig(TwigHandlers => {'paragraph' => \&paragraph} +,TwigRoots => {paragraph => 1}); $twig->parse($file); $twig->print; sub paragraph { my ($twig, $para,) = @_; my $para_text = $para->text; &choiceReplace($para,$para_text); } sub choiceReplace { my ($para,$para_text) = @_; my $search = "number"; #setting search and replace for example my $replace = "quantity"; #locate each occurrence of search term and prompt user to replace foreach ($para_text =~ /$search/) { my $new_version = $para_text; my $offset = 0; my $new_offset = 0; my $result = index($para_text, $search); my $new_result = index($new_version, $search); $offset = $result; $new_offset = $new_result; my $l = length ($search); #loop through string search results while found while (($result != -1) && ($new_result != -1)) { #create visuals for user to accept/deny match print "\n\nCurrent Version:\n $para_text"; my $replace_match = "**[[$replace]]**"; substr($new_version,$new_offset,$l) = $replace_match; print "\n\nMatched Version:\n $new_version"; print "\nWould you like to make this change (y or n)? "; chomp($change=<STDIN>); if ($change eq "n"){ my $nm_result = rindex($new_version, $replace_match); my $nm_l = length ($replace_match); my $no_match = "[DENIED]"; substr($new_version,$nm_result,$nm_l) = $no_match; } elsif ($change eq "y") { my $nm_result = rindex($new_version, $replace_match); my $nm_l = length ($replace_match); my $no_match = "[CHANGED]"; substr($new_version,$nm_result,$nm_l) = $no_match; substr($para_text,$offset,$l) = $replace; } #update search starting point $result = index($para_text, $search, $offset + 1); $offset = $result; $new_result = index($new_version, $search, $new_offset); $new_offset = $new_result; } #set text for <paragraph> @_[0]->set_text($para_text); } }

        The fact that you want the user input for each substitution makes the problem a _lot_ more tricky. Otherwise you could simply use subs_text as in my previous answer.

        In this case you should do the substitution on the text of the '#TEXT' (or '#PCDATA') children of the paragraph. But then what happens if the text contains twice the string you want to replace, and you want only to replace the second one? The logic becomes quite a bit more complex. Of the top of my head I would modify the regexp, to let it skip the appropriate number of occurrences of the string to replace.

        A non interactive version that you could use as a basis would be:

        #!/usr/bin/perl use strict; use warnings; use Test::More tests => 1; use XML::Twig; my $doc = '<paragraph> Some <bold>text</bold> here which may be any <b +old>length</bold> and <bold>contain</bold> a number of child tags.</p +aragraph>'; my $exp = '<paragraph> Some <bold>text</bold> here which may be any <b +old>length</bold> and <bold>contain</bold> a quantity of child tags.< +/paragraph>'; # I got a little fancy here to allow several keywords to replace # the keywords are grouped in a regexp, sorted by inverse length so th +e alternation works properly my $replace = { number => 'quantity' }; my $keywords= join( '|', map { "\Q$_\E" } sort { length$b <=> length $ +a } keys %$replace); my $t=XML::Twig->new( twig_roots => { paragraph => \&subs_word })->par +se( $doc); is( $t->sprint, $exp, 'one change') ; exit; sub subs_word { my( $t, $para)= @_; foreach my $text_elt ($para->children( '#TEXT')) { my $text= $text_elt->text; if( $text_elt->text=~ m{\b($keywords)\b}) { $text=~ s{\b($keywords)\b}{$replace->{$1}}g; $text_elt->set_text( $text); } }