in reply to Get Node Value from irregular XML

madbee:

Sure thing. You just need to know which paths to check, and then stop checking when you find the desired string. Since you're looking for a string with the word "design", we can do it like this:

#!/usr/bin/perl use strict; use warnings; use autodie; use XML::LibXML; use Data::Dumper; my @Docs = ( <<EOXML, <root> <part> <sect> <header> This is a design XZY document for Project </header> </sect> </part> </root> EOXML <<EOXML <root> <para>This is a design XZY document for Project</para> <part> <sect> <header> This is some header </header> </sect> </part> </root> EOXML ); for (my $idx=0; $idx<@Docs; ++$idx) { my $XML = $Docs[$idx]; print "----------- SEARCHING DOCUMENT $idx ---------\n"; my $dom = XML::LibXML->load_xml( string=> $XML ); DOCSEARCH: for my $search ('/root/part/sect/header', '/root/para') { print "----- searching: $search\n"; my $nodeset = $dom->find($search); foreach my $node($nodeset -> get_nodelist) { $node -> string_value; if ($node =~ m/design/i) { my $design= $node; print $design, "\n"; last DOCSEARCH; } } } }

Here, the outer loop is for each XML document, the middle loop iterates over the different possible search paths, and the inner loop digs out the particular chunk in question. We labelled the middle loop DOCSEARCH, so when we finally find the item, we can use last DOCSEARCH; to jump to the end of the middle loop and advance to the next document.

When I run it, I get:

$ perl 1041480.pl ----------- SEARCHING DOCUMENT 0 --------- ----- searching: /root/part/sect/header <header> This is a design XZY document for Project </header> ----------- SEARCHING DOCUMENT 1 --------- ----- searching: /root/part/sect/header ----- searching: /root/para <para>This is a design XZY document for Project</para>

Update: Added a "\n" to the print line to clean up the output a little.

..roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Get Node Value from irregular XML
by madbee (Acolyte) on Jun 29, 2013 at 18:19 UTC

    @roboticus: Thanks so much for your help..I've tried this and it works. Didnt know we could specify multiple search paths. So, the search stops at the first occurence of "design". Only problem is, it may or may not be the right design. So I'll have to find out if I can specify any other co-occuring terms with it.

    Hopefully,I wont run into those situations, but if I do, I'll be extracting the wrong value if I just went by design.

    Regards, Madbee

      madbee:

      If multiple searches could yield different results and you have an algorithm to determine which one is "better", then instead of stopping the search when you find the first one, call your function to score the result and stow it away. Then, once you do all the searches, you can choose the best one. Something like:

      #!/usr/bin/perl use strict; use warnings; use autodie; use XML::LibXML; use Data::Dumper; for my $FName (qw(1041480.1 1041480.2)) { print "----------- SEARCHING DOCUMENT $FName ---------\n"; my $dom = XML::LibXML->load_xml(location=>$FName); my @hits; for my $search ('/root/part/sect/header', '/root/para') { print "----- searching: $search\n"; my $nodeset = $dom->find($search); my $text = join('', map { $_->string_value } $nodeset->get_nod +elist); if ($text =~ /design/i) { # found a match! score it and store it my $score = goodness_evaluator($text); push @hits, [ $score, $text ]; } } if (@hits) { @hits = sort {$a->[0] <=> $b->[0]} @hits; my ($score, $text) = @{$hits[0]}; print "$FName: Best match: score=$score, text=$text\n"; } else { print "$FName: no matches found\n"; } } sub goodness_evaluator { my $t = shift; my $score = 0; $score += ord($_) for $t=~m/(.)/g; return $score; }

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.