comment on

Hi all, I have an input file looking like:


1<species xxx = "sp">
1 <sequence xx = "" xxxxx = "xxxxxxx">
1  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
1   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
1    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
1    </gene_seq>
1   </gene>
1  </genome_feature>
1 </sequence>
1</species>
2<species xxx = "sp">
2 <sequence xx = "" xxxxx = "xxxxxxx">
2  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
2   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
2    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
2    </gene_seq>
2   </gene>
2  </genome_feature>
2 </sequence>
2</species>

etc.........................................

(xxxxxxs substitute real words)
[download]

My program goes through this file (above) with a while statement. It seeks to remove duplicate nodes so that it does not go back to the root nodes of species, sequence etc. for each gene tag.But it does not work. I have a subroutine along the lines of: (just add a few more lines dealing with more XML nodes)


sub deal_with_xml_line_by_line($){
    
    $final_out = "new_out_again.txt";
    open (OUTPUT_SLIMED, "+>>$final_out");
    
    my ($XML_line) = @_;
    
    $XML_class_node_X_old = $XML_class_node_X;
    $XML_class_first_node_old = $XML_class_first_node;

    if ($XML_process_line =~ /^(\d{1,10})([\%|\<].{1,1000}\>)/){
 print "\nhereF\n";
 print "\n$1\n";
 #exit;
 $XML_class_node_X = "$1.$2";
 
 if ($XML_class_node_X_old == $XML_class_node_X){
     #do nothing 
 }
 else{
     print OUTPUT_SLIMED "$XML_class_node_X\n";
     return $XML_class_node_X;
 }
    }
    if ($XML_process_line =~ /^(\d{1,10})(\s[\%|\<].{1,1000}\>)/){
 print "\nhereF\n";
 print "\n$1.$2\n";
 #exit;
 $XML_class_first_node = $1.$2; 
 # print ":$XML_class_fist_node\n";
 
 if ($XML_class_first_node_old == $XML_class_first_node){
     #do nothing 
 }
 else{
     print OUTPUT_SLIMED "$XML_class_first_node\n";
     return $XML_class_first_node;
 }
    }
}
[download]

The output produced from this is :

1 <species xxx = "sp">
1 <sequence xx = "" xxxxx = "xxxxxxx">
1  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
1   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
1    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
2 <species xxx = "sp">
2 <sequence xx = "" xxxxx = "xxxxxxx">
2  <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " ">
2   <gene xx = "xxxxxxxxxxx" xxxxxx = "x">
2    <gene_seq xxxxxxx = ""  xxxxxx =  ""   xxxxxxx = "2"  xxxxxxxxx =
+ ""  xxxxx = "5999"  xxxx = "6318"  xxxxxxx = ""  xxxxxxx = ""  xxxxx
+x = "F">
[download]

This is not what I want. Given the time I expect that I could solve this problem. But I have to go to bed now. Any suggestions?

In reply to Removing duplicate subtrees from XML by matth

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.