Removing duplicate subtrees from XML

matth has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have an input file looking like:


1<species xxx = "sp">
1 <sequence xx = "" xxxxx = "xxxxxxx">
1  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
1   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
1    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
1    </gene_seq>
1   </gene>
1  </genome_feature>
1 </sequence>
1</species>
2<species xxx = "sp">
2 <sequence xx = "" xxxxx = "xxxxxxx">
2  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
2   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
2    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
2    </gene_seq>
2   </gene>
2  </genome_feature>
2 </sequence>
2</species>

etc.........................................

(xxxxxxs substitute real words)
[download]

My program goes through this file (above) with a while statement. It seeks to remove duplicate nodes so that it does not go back to the root nodes of species, sequence etc. for each gene tag.But it does not work. I have a subroutine along the lines of: (just add a few more lines dealing with more XML nodes)


sub deal_with_xml_line_by_line($){
    
    $final_out = "new_out_again.txt";
    open (OUTPUT_SLIMED, "+>>$final_out");
    
    my ($XML_line) = @_;
    
    $XML_class_node_X_old = $XML_class_node_X;
    $XML_class_first_node_old = $XML_class_first_node;

    if ($XML_process_line =~ /^(\d{1,10})([\%|\<].{1,1000}\>)/){
 print "\nhereF\n";
 print "\n$1\n";
 #exit;
 $XML_class_node_X = "$1.$2";
 
 if ($XML_class_node_X_old == $XML_class_node_X){
     #do nothing 
 }
 else{
     print OUTPUT_SLIMED "$XML_class_node_X\n";
     return $XML_class_node_X;
 }
    }
    if ($XML_process_line =~ /^(\d{1,10})(\s[\%|\<].{1,1000}\>)/){
 print "\nhereF\n";
 print "\n$1.$2\n";
 #exit;
 $XML_class_first_node = $1.$2; 
 # print ":$XML_class_fist_node\n";
 
 if ($XML_class_first_node_old == $XML_class_first_node){
     #do nothing 
 }
 else{
     print OUTPUT_SLIMED "$XML_class_first_node\n";
     return $XML_class_first_node;
 }
    }
}
[download]

The output produced from this is :

1 <species xxx = "sp">
1 <sequence xx = "" xxxxx = "xxxxxxx">
1  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
1   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
1    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
2 <species xxx = "sp">
2 <sequence xx = "" xxxxx = "xxxxxxx">
2  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
2   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
2    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
[download]

This is not what I want. Given the time I expect that I could solve this problem. But I have to go to bed now. Any suggestions?

Comment on Removing duplicate subtrees from XML Select or Download Code

Replies are listed 'Best First'.
Re: Removing duplicate subtrees from XML by dingus (Friar) on Dec 03, 2002 at 08:28 UTC
Two suggestions: 1) if the data is machine generated and you know that it will be identical when the gene sequence etc. is identical, that is to say you know that you will never see the tags presented in a a different order then you can usefully compare whole records at a time by setting the input record seperator appropriately: `$/ = '</species>.$/; my $prev_rec=''; while (<INFILE>) { next if ($_ eq $prev_rec); $prev_rec = $_; # process $_ somehow as it is unique }` [download] Note that in addition to the drawback noted above, this code requires records that are identical to be adjacent. 2) A better way is probably to XML::Twig (or XML::Simple perhaps) the file and compare the resulting data structures. This avoids the tags must be in identical order problem and, depending on your code, may also avoid the identical records must be adjacent problem too. A cunning way to compare two arbitrary data structures is to use Data::Dumper and string comparisons: `use Data::Dumper; if (Dumper(\%struct1) eq Dumper(\%struct2)) { do something; }` [download] The disadvantage is that this second method will be quite a lot slower. Because I quite like processing enormous files quickly I have used methods similar to the first method successfully on records pulled from pubmed. Dingus `Enter any 47-digit prime number to continue.`	[reply] [d/l] [select]
Re: Removing duplicate subtrees from XML by elusion (Curate) on Dec 03, 2002 at 01:55 UTC
I can see right now a BIG problem with this line: `if ($XML_process_line =~ /^(\d{1,10})([\%\|\<].{1,1000}\>)/){` I don't think it does what you want. First of all, this: `[\%\|\<]`. You use `[]` and `\|`, I think you want one or the other. If you want to alternate between `%` and `<`, use `[\%<]`. Second, and more important, this: `.{1,1000}>`. Perl's regexes are greedy, that means that if you do this: `"<one></one>" =~ /<(.{1,1000})>/;print $1;`, you're going to get `one></one` printed, because it matches as many characters as possible before stopping. Instead, you'd want to use `[\%\|\<][^>]{1,1000}>`, which uses a negative character class. That being said, this is hard to do and even harder to do right, so you should use a module. I would suggest XML::Twig, but there plenty of others as well. elusion : http://matt.diephouse.com Update: I also noticed that you use two variables for your line. You assign to `$XML_line`, but use your regex on `$XML_process_line`. Remember to use -w and strict.	[reply] [d/l] [select]
Re: Re: Removing duplicate subtrees from XML by matth (Monk) on Dec 03, 2002 at 11:27 UTC
reply to Update. Well spotted. I hand edited that variable in an attempt make the variable names more meaningfull, prior to pasting. Thanks for all the advice.	[reply]
Re: Removing duplicate subtrees from XML by dakkar (Hermit) on Dec 03, 2002 at 11:51 UTC
I think it would be better if you used some of the XML modules out there. Parsing XML by hand is (nearly) always the wrong choice. For example, I wrote a simple XSLT solution. It's not Perl, and it's not pretty, but it seems to work. Read more... (4 kB) -- dakkar - Mobilis in mobile	[reply] [d/l] [select]
Re: Re: Removing duplicate subtrees from XML by matth (Monk) on Dec 03, 2002 at 12:49 UTC
XSLT look like a logical way of representing the data structure. However, it also looks like I would have to spend some time learning about this system. I may come back to it in the future. Is there first class perl module support for XSLT?	[reply]
Re: Re: Re: Removing duplicate subtrees from XML by mirod (Canon) on Dec 03, 2002 at 12:58 UTC
XSLT is a language based on very different principles than Perl. It is very popular for XML transformation (although I could never quite figure out why ;--). See Recommend XSL module for a list of Perl modules that support it (usually by interfacing with external XSLT engines).	[reply]
Re: Removing duplicate subtrees from XML by Zaxo (Archbishop) on Dec 03, 2002 at 02:15 UTC
If I understand correctly, you want to rewrite the xml file to consolodate like roots in a common root subtree. I made a stab at it, but your sample data is too ill-formed to test. Please post some correct data. Schematic or sketchy data is ok, but at least validate it. After Compline, Zaxo	[reply]
Re: Removing duplicate subtrees from XML by mirod (Canon) on Dec 03, 2002 at 12:41 UTC
Your requirements are a bit sketchy and I can see 2 things you might want to do in this case (once you get proper XML of course ;--): get the minimun document with properly nested container elements, each "low-level" element being in the proper hierarchy. In general, unless the low-level elements are already ordered (in which case see below), you will need to load the entire data set in memory, which might be a problem if you are dealing with the Human Genome, just try to merge adjacent nodes, that is if a node can be merged with the previous one, then be it, otherwise don't try to merge it with nodes that are further apart. This can be performed by keeping just parts of the data in memory, so it can be performed on much bigger data sets. This should give the same result as the previous option if the nodes are sorted. Here is a piece of code using the usual suspect that does both (on a made up XML data set, but adapting it to your data should be really easy): #!/usr/bin/perl -w use strict; use XML::Twig; $/="\n\n"; # tag => attribute we are interested in # you could avoid having this global by putting it in the twig my %att= ( elt => 'elt_class', subelt => 'subelt_class', ); my $doc = <DATA>; # the original data set my $expected_sorted_doc = <DATA>; # sorted result my $expected_merged_doc = <DATA>; # merged result my $sorted_doc= sort_doc( $doc); if( compact( $sorted_doc) eq compact( $expected_sorted_doc)) { print "sorted doc generation OK\n"; } else { print "sorted doc generation NOK: \n", "expected:\n$expected_sorted_doc\n", "found:\n$sorted_doc\n"; } my $merged_doc= merge_doc( $doc); if( compact( $merged_doc) eq compact( $expected_merged_doc)) { print "merged doc generation OK\n"; } else { print "merged doc generation NOK: \n", "expected:\n$expected_merged_doc\n", "found:\n$merged_doc\n"; } # sort: for each relevant node (elt or subelt in this case) generate a + location # key and move the content if the location already exists sub sort_doc { my( $doc)= @_; my $location={}; # location key => existing subelt element with t +his location key my $t= XML::Twig->new( twig_handlers => { elt => sub { sort_nod +e( $location, @_); }, subelt => sub { sort_nod +e( $location, @_); }, }, pretty_print => 'indented', # makes debuggi +ng easier ); $t->parse( $doc); return $t->sprint; } sub sort_node { my( $location, $t, $node)= @_; # compute the location key, which must describe uniquely the node +category my $location_key= location( $node); # now see if we need to move the content if( my $new_parent= $location->{$location_key}) { # there is already an element with this location key # move all content's there foreach my $content ($node->children) { $content->move( last_child => $new_parent); } # no need to keep the empty shell $node->delete unless( $node->has_child); } else { # first time we see the location key, store the element in $lo +cation $location->{$location_key}= $node; } } # the location describes a node category, nodes with the same location + should have the same parent sub location { my( $node)= @_; # a compact way to just join the values of the proper attributes o +f the ancestors of the node my $location= join( '-', grep {$_} map { $_->att( $att{$_->tag}) \| +\| '' } (@{[$node->ancestors]}, $node)); #warn "location: $location\n"; return $location; } # merge doc sub merge_doc { my( $doc)= @_; my $t= XML::Twig->new( twig_handlers => { elt => \&merge_node, subelt => \&merge_node, }, pretty_print => 'indented', ); $t->parse( $doc); return $t->sprint; } sub merge_node { my( $t, $node)= @_; my $potential_merger= $node->prev_elt( $node->tag) or return; # re +turn if this is the first node of this type if( location( $node) eq location( $potential_merger)) { # bingo! we can merge the contents foreach my $content ($node->children) { $content->move( last_child => $potential_merger); } } else { # this branch is not used for this test as we are working in m +emory # but this is where you could free the memory by dumping the p +art of # the tree that will no longer need to be updated # $t->flush_up_to( $potential_merger); } $node->delete unless( $node->has_child); } sub compact { my( $doc)= @_; $doc=~ s{^\s+}{}; # trim at the begining $doc=~ s{\s+$}{}; # trim the end $doc=~ s{>\s*<}{><}g; # trim spaces between tags return $doc; } __DATA__ <doc> <elt elt_class="class1"> <subelt subelt_class="sclass1"><content id="content1"/></subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass1"><content id="content2"/></subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass2"><content id="content3"/></subelt> </elt> <elt elt_class="class2"> <subelt subelt_class="sclass3"><content id="content4"/></subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass1"><content id="content5"/></subelt> </elt> </doc> <doc> <elt elt_class="class1"> <subelt subelt_class="sclass1"> <content id="content1"/> <content id="content2"/> <content id="content5"/> </subelt> <subelt subelt_class="sclass2"> <content id="content3"/> </subelt> </elt> <elt elt_class="class2"> <subelt subelt_class="sclass3"> <content id="content4"/> </subelt> </elt> </doc> <doc> <elt elt_class="class1"> <subelt subelt_class="sclass1"> <content id="content1"/> <content id="content2"/> </subelt> <subelt subelt_class="sclass2"> <content id="content3"/> </subelt> </elt> <elt elt_class="class2"> <subelt subelt_class="sclass3"> <content id="content4"/> </subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass1"> <content id="content5"/> </subelt> </elt> </doc> [download]	[reply] [d/l]
Re: Re: Removing duplicate subtrees from XML by matth (Monk) on Dec 09, 2002 at 16:16 UTC
Thanks for this code. It works very well with your data. Unfortunately it does not seem to work when attribute data is missing. Is there an easy way to overcome this?	[reply]