Isanchez has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have to recursively go over xml files (that look very differently from each other) in a folder and collect every content for every tag. I have tried with TWig code but it doesn't work because it grabs all tags including parent tags and then prints first the contents that belong to the parents i.e. all and then the contents again but this time for each doughter node. Can any wise monk give some idea of what to do ? thanks,
  • Comment on getting the txt of terminal nodes only XML::TWIG

Replies are listed 'Best First'.
Re: getting the txt of terminal nodes only XML::TWIG
by rob_au (Abbot) on May 01, 2003 at 03:04 UTC
    It would be really useful to see some of your code to see how you are approaching this problem, particularly as I believe this could be a very, very easy problem to fix. In the section of your code where you are printing node content, you can restrict the node content printed to only that of terminal nodes by including a conditional similar to the following:

    print $node->trimmed_text, "\n" unless $node->children_count;

    This code employs the children_count method of the element object, which returns a the number of child nodes of the given node, as the conditional for output. See the XML::Twig documentation for further information.

     

    perl -le 'print+unpack("N",pack("B32","00000000000000000000001001010101"))'

      Rob, thanks a lot for the response, this is some code that actually gets the text out of terminal nodes only. But, it prints the text of sister nodes together which makes it non usable. Can you please lead me to solve this problem? thanks a lot.

      a sample input file with the message I sent is below. As you can see when you run it, you get the text of the sister nodes ...thanks printed together with reputation. I tried your line of code, but I can't make it work. thanks again, Ivo

      #!/bin/perl -w use strict; use XML::Twig; my $twig= XML::Twig->new(); my $file = "message.xml"; $twig ->parsefile( $file ); my $root = $twig->root; my @nodes = $root ->children; foreach my $node ( @nodes) { my $content = $node->text ; print "$content "; }
      OUtput:

      perlquestion Isanchez Hi, I have to recursively go over xml files (that look very differently from each other) in a folder and collect every content for every tag. I have tried with TWig code but it doesn't work because it grabs all tags including parent tags and then prints first the contents that belong to the parents i.e. all and then the contents again but this time for each doughter node. Can any wise monk give some idea of what to do ? thanks,"reputation"

      edited: Thu May 1 21:28:03 2003 by jeffa - code tags, formatting

Re: getting the txt of terminal nodes only XML::TWIG
by jkahn (Friar) on May 01, 2003 at 22:26 UTC
    This worked for me, if I understand you correctly:
    #!/usr/bin/perl -w use strict; use XML::Twig; my $twig= XML::Twig->new(); my $file = "message.xml"; $twig ->parsefile( $file ); my $root = $twig->root; my @all_text = gather_text($root); print join ("\n---", @all_text), "\n"; sub gather_text { my $node = shift; my (@children) = $node->children(); if (not @children) { # this tag has no children. grab its text data, with no # surrounding tag, and return it. return $node->sprint('NOTAGS'); } else { # recurse into each child my (@text); foreach (@children) { push @text, (gather_text($_)); } return @text; } }
    When I test it on the XML of your original post, I get the following results:
    [jeremy@serpent pm-test]$ ./term-xml.pl perlquestion --- Isanchez --- Hi, I have to recursively go over xml files (that look very differently from each other) in a folder and collect every content for every tag. I have tried with TWig code but it doesn't work because it grabs all tags including parent tags and then prints first the contents that belong to the parents i.e. all and then the contents again but this time for each doughter node. Can any wise monk give some idea of what to do ? thanks, --- 4 [jeremy@serpent pm-test]$
    That's only once for each tag, rather than once for each tag and then again for its children.

    Does that help?

    (Nice to see you here, Isanchez!)
    Update: corrected link to XML.