comment on

I have XML files which I need to extract out certain values. So I wrote a code where it simply search for relevant text strings and extract out the values accordingly.

The problem is, how do I print the value of the formal_charge tag (e.g. -1)?

E.g. of my XML file:

    <weight>18.998403205</weight>
    <name>fluoride</name>
    <smiles>[F-]</smiles>
    <accession>W00662</accession>
-<experimental_properties>
-<property>
    <kind>water_solubility</kind>
    <value>0.00169 mg/mL at 25 °C</value>
    <source/>
</property>
-<predicted_properties>    
-<property>
    <kind>formal_charge</kind>
    <value>-1</value>
    <source>ChemAxon</source>
</property>
[download]

Values I want is

weight (18.9984),

name (fluoride),

accession (W00662),

formal_charge (-1)

My code:

sub load_files() {
  #get a list of all files in directory; ignore all files beginning wi
+th a . and other sub directories
  opendir(my $dh, $dirname) or die "can't opendir $dirname: $!";
  my @files = grep (/^[^\.]/ && -f "$dirname/$_", readdir($dh)); #only
+ keep those not beginning with '.' and are files  
  @files = sort(@files); #sort lexically, 'B' comes before 'a', so tha
+t output list is always in same order
  closedir $dh;
  
  my $numfiles = 0;

  foreach my $file (@files) { #loop through the files
    $numfiles++;
    
    my $accefound = 0;
    my $namefound = 0;
    my $monofound = 0;
    my $chargefound =0;
        
    open(my $file_fh, "< $dirname/$file") or die("$$: Error: failed to
+ open file $dirname/$file. $!\n");
    while(<$file_fh>) { #read each line of file
      if (/(<weight>)(.+)(<\/weight>)/ && !$monofound) { #if first enc
+ounter with the tag
        $monofound = $2;
        $monofound =~ s/^\s+//; #trim leading whitespace of string
        $monofound =~ s/\s+$//; #trim trailing whitespace of string
      }
    elsif (/(<name>)(.+)(<\/name>)/ && !$namefound) { #if first encoun
+ter with the tag
        $namefound = $2;
        $namefound =~ s/^\s+//; #trim leading whitespace of string
        $namefound =~ s/\s+$//; #trim trailing whitespace of string
      }
    
      elsif (/(<accession>)(.+)(<\/accession>)/ && !$accefound) { #if 
+first encounter with the tag (the tag might not be unique)
        $accefound = $2;
        $accefound =~ s/^\s+//; #trim leading whitespace of string
        $accefound =~ s/\s+$//; #trim trailing whitespace of string
      }
      
         elsif (/(<formal_charge>)(.+)(<\/formal_charge>)/ && !$charge
+found) { #if first encounter with the tag
        $chargefound = $2;
        $chargefound =~ s/^\s+//; #trim leading whitespace of string
        $chargefound =~ s/\s+$//; #trim trailing whitespace of string
      }

    }
    print "$monofound\t$namefound\t$accefound\t$chargefound\n";
 
    close($file_fh) or die("$$: Error: failed to close file $dirname/$
+file. $!\n");
  }
      
}



main();
[download]

What I got is:

_OUTPUT DATA_
18.998403205    fluoride    W00662    0
[download]

The charge value is not reflecting -1, but it put "0" instead. I know it should match the word "value" , but in this case, there are many "value" tags in the file, so how do I actually match it to this value tag instead of incorrectly match other value tag?

-<property>
    <kind>formal_charge</kind>
    <value>-1</value>  
    <source>ChemAxon</source>
</property>
[download]

I hope there is no need to involve any module and just searching relevant match string is sufficient?

In reply to Getting the value from XML file by hellohello1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.