Get Node Value from irregular XML

madbee has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I have an XML document which is below. For this XML, I need to extract the node value based on the keyword. i.e Based on the keyword "design", I need to extract the entire string between header nodes.

    <root>
        <part>
            <sect>
               <header>
                   This is a design XZY document for Project
                </header>
            </sect>
         </part>
     </root>

For this, I have the below Perl script:
my $dom = XML::LibXML->new->parse_file($file);

my $nodeset = $dom->find('/root/part/sect/header');

foreach my $node($nodeset -> get_nodelist)
{
        $node -> string_value;

        if ($node =~ m/design/i)
        {
          my $design= $node;
        print $design;
        }
}
[download]

The problem is, I need to do this across multiple xmls for which I noticed that the string I am looking for is in another part of the doc. example: it is under:

 
    <root>
      <para>This is a design XZY document for Project</para>
      <part>
         <sect>
           <header>
               This is some header
            </header>
         </sect>
       </part>
     <root>
[download]

The value occuring under root/para tags is an anamoly but valid which I have to accomodate for. Given such irregular xmls, is there a way I can incorporate these 2 scenarios using one generic code? Ofcourse, a much devious roundabout way would be to first check the valid node and if not found then go back to under root. But I was wondering if there is a simpler way to do this and was hoping for some help here.

Thanks in advance for your time and apologies if the question is not clear enough.

Regards, Madbee

Comment on Get Node Value from irregular XML Select or Download Code

Replies are listed 'Best First'.
Re: Get Node Value from irregular XML by roboticus (Chancellor) on Jun 29, 2013 at 17:16 UTC
madbee: Sure thing. You just need to know which paths to check, and then stop checking when you find the desired string. Since you're looking for a string with the word "design", we can do it like this: #!/usr/bin/perl use strict; use warnings; use autodie; use XML::LibXML; use Data::Dumper; my @Docs = ( <<EOXML, <root> <part> <sect> <header> This is a design XZY document for Project </header> </sect> </part> </root> EOXML <<EOXML <root> <para>This is a design XZY document for Project</para> <part> <sect> <header> This is some header </header> </sect> </part> </root> EOXML ); for (my $idx=0; $idx<@Docs; ++$idx) { my $XML = $Docs[$idx]; print "----------- SEARCHING DOCUMENT $idx ---------\n"; my $dom = XML::LibXML->load_xml( string=> $XML ); DOCSEARCH: for my $search ('/root/part/sect/header', '/root/para') { print "----- searching: $search\n"; my $nodeset = $dom->find($search); foreach my $node($nodeset -> get_nodelist) { $node -> string_value; if ($node =~ m/design/i) { my $design= $node; print $design, "\n"; last DOCSEARCH; } } } } [download] Here, the outer loop is for each XML document, the middle loop iterates over the different possible search paths, and the inner loop digs out the particular chunk in question. We labelled the middle loop DOCSEARCH, so when we finally find the item, we can use `last DOCSEARCH;` to jump to the end of the middle loop and advance to the next document. When I run it, I get: `$ perl 1041480.pl ----------- SEARCHING DOCUMENT 0 --------- ----- searching: /root/part/sect/header <header> This is a design XZY document for Project </header> ----------- SEARCHING DOCUMENT 1 --------- ----- searching: /root/part/sect/header ----- searching: /root/para <para>This is a design XZY document for Project</para>` [download] Update: Added a "\n" to the print line to clean up the output a little. ..roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: Get Node Value from irregular XML by madbee (Acolyte) on Jun 29, 2013 at 18:19 UTC
@roboticus: Thanks so much for your help..I've tried this and it works. Didnt know we could specify multiple search paths. So, the search stops at the first occurence of "design". Only problem is, it may or may not be the right design. So I'll have to find out if I can specify any other co-occuring terms with it. Hopefully,I wont run into those situations, but if I do, I'll be extracting the wrong value if I just went by design. Regards, Madbee	[reply]
Re^3: Get Node Value from irregular XML by roboticus (Chancellor) on Jun 29, 2013 at 19:25 UTC
madbee: If multiple searches could yield different results and you have an algorithm to determine which one is "better", then instead of stopping the search when you find the first one, call your function to score the result and stow it away. Then, once you do all the searches, you can choose the best one. Something like: #!/usr/bin/perl use strict; use warnings; use autodie; use XML::LibXML; use Data::Dumper; for my $FName (qw(1041480.1 1041480.2)) { print "----------- SEARCHING DOCUMENT $FName ---------\n"; my $dom = XML::LibXML->load_xml(location=>$FName); my @hits; for my $search ('/root/part/sect/header', '/root/para') { print "----- searching: $search\n"; my $nodeset = $dom->find($search); my $text = join('', map { $_->string_value } $nodeset->get_nod +elist); if ($text =~ /design/i) { # found a match! score it and store it my $score = goodness_evaluator($text); push @hits, [ $score, $text ]; } } if (@hits) { @hits = sort {$a->[0] <=> $b->[0]} @hits; my ($score, $text) = @{$hits[0]}; print "$FName: Best match: score=$score, text=$text\n"; } else { print "$FName: no matches found\n"; } } sub goodness_evaluator { my $t = shift; my $score = 0; $score += ord($_) for $t=~m/(.)/g; return $score; } [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re: Get Node Value from irregular XML by NetWallah (Canon) on Jun 29, 2013 at 17:46 UTC
Late to the party, but here is an alternative, using XML::Twig - somewhat less code, but handles only one file at a time: #!/bin/perl -w use strict; use XML::Twig; my $twig= new XML::Twig( twig_handlers => { _all_ => \&Piece_handler , # Call for each 'Program' +elemetn } ); $twig->parsefile( shift @ARGV ); # build the twig #------------------------------------------------- sub Piece_handler{ my( $twig, $p)= @_; # handlers params are alwa +ys # the twig and the el +ement return unless $p->text && $p->text=~/design/; print $p->tag , ":",$p->text,"\n"; #Traverse and print ancestor tags for (my $ancestor = $p->parent; defined $ancestor; $ancestor=$anc +estor->parent()){ print "\tParent : ", $ancestor->tag, "\n"; } $p->cut; # This one is processed, delete from tree .. } [download] Produces the following (After adding a proper closing </root> tag to the second example): `$ perl xmlt.pl xmldata2.xml para:This is a design XZY document for Project Parent : root $ perl xmlt.pl xmldata.xml header: This is a design XZY document for Project Parent : sect Parent : part Parent : root` [download] "The trouble with the Internet is that it's replacing masturbation as a leisure activity." -- Patrick Murray	[reply] [d/l] [select]
Re: Get Node Value from irregular XML (xpather.pl) by Anonymous Monk on Jun 30, 2013 at 04:15 UTC
If you run `xpather.pl -a yourfile.xml` you'll see `# posy /root[1]/part[1]/sect[1]/header[1] # star /[ name() = "root" and position() = 1 ] /[ name() = "part" and position() = 1 ] /[ name() = "sect" and position() = 1 ] /[ name() = "header" and position() = 1 ] # "content" This is a design XZY document for Project` [download] Just like position() there is also contains() , so you can use `//[ ( name() = "para" or name() = "header" ) and ( not(descendant::) ) and ( contains( translate( ., "DESIGN", "design", ), "design" ) ) ]` [download]	[reply] [d/l] [select]
Re^2: Get Node Value from irregular XML (xpather.pl) by Anonymous Monk on Jun 30, 2013 at 04:41 UTC
Hmm, libxml seems to not like double quotes, weird, this version tested `//[ ( name() = 'para' or name() = 'header' ) and ( not(descendant::) ) and ( contains( translate( ., 'DESIGN', 'design' ), 'design' ) ) ]` [download]	[reply] [d/l]
Re^3: Get Node Value from irregular XML (xpather.pl) by Anonymous Monk on Jun 30, 2013 at 05:13 UTC
Hmm, libxml seems to not like double quotes, weird, this version tested Nope, it likes double quotes just fine, I got tripped by by win32 system limitations	[reply]