Parse XML and compare with Fasta in Perl

ad23 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have multiple XML files in a directory. They have the following information:

<?xml version="1.0"?>
<t_volume>
        <info>
            <info_name>FZGA34177.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <qual_file>SETARIA_ITALICA/JGI/qscore/FZGA34177.b1.qscore<
+/qual_file>
            <it_flank_left>AATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGG
+ATCCCAC</it_flank_left>
            <it_flank_right>GTGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTGAG
+TATTCTAT</it_flank_right>
            <it_size>35000</it_size>
            <it_stdev>3500</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <ncbi_project_id>32913</ncbi_project_id>
            <ncbi_info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
                <basecall_length>899</basecall_length>
                <load_date>Nov 26 2008  4:06PM</load_date>
                <state>active</state>
            </ncbi_info_archive>
        </info>
        <info>
            <info_name>FZGA34178.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <qual_file>SETARIA_ITALICA/JGI/qscore/FZGA34177.b1.qscore<
+/qual_file>
            <it_flank_left>AATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGG
+ATCCCAC</it_flank_left>
            <it_flank_right>GTGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTGAG
+TATTCTAT</it_flank_right>
            <it_size>12000</it_size>
            <it_stdev>1200</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <ncbi_project_id>32913</ncbi_project_id>
            <ncbi_info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
                <basecall_length>899</basecall_length>
                <load_date>Nov 26 2008  4:06PM</load_date>
                <state>active</state>
            </ncbi_info_archive>
        </info>
        <info>
            <info_name>FZGA34179.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <qual_file>SETARIA_ITALICA/JGI/qscore/FZGA34177.b1.qscore<
+/qual_file>
            <it_flank_left>AATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGG
+ATCCCAC</it_flank_left>
            <it_flank_right>GTGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTGAG
+TATTCTAT</it_flank_right>
            <it_size>7000</it_size>
            <it_stdev>700</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <ncbi_project_id>32913</ncbi_project_id>
            <ncbi_info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
                <basecall_length>899</basecall_length>
                <load_date>Nov 26 2008  4:06PM</load_date>
                <state>active</state>
            </ncbi_info_archive>

        </info>
        <info>
            <info_name>FZGA34180.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <qual_file>SETARIA_ITALICA/JGI/qscore/FZGA34177.b1.qscore<
+/qual_file>
            <it_flank_left>AATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGG
+ATCCCAC</it_flank_left>
            <it_flank_right>GTGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTGAG
+TATTCTAT</it_flank_right>
            <it_size>3000</it_size>
            <it_stdev>300</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <ncbi_project_id>32913</ncbi_project_id>
            <ncbi_info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
                <basecall_length>899</basecall_length>
                <load_date>Nov 26 2008  4:06PM</load_date>
                <state>active</state>
            </ncbi_info_archive>

        </info>
        <info>
            <info_name>FZGA34181.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <qual_file>SETARIA_ITALICA/JGI/qscore/FZGA34177.b1.qscore<
+/qual_file>
            <it_flank_left>AATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGG
+ATCCCAC</it_flank_left>
            <it_flank_right>GTGGGATCCTCTAGAGTCGACCTGCAGGCATGCAAGCTTGAG
+TATTCTAT</it_flank_right>
            <it_size>7000</it_size>
            <it_stdev>700</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <ncbi_project_id>32913</ncbi_project_id>
            <ncbi_info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
                <basecall_length>899</basecall_length>
                <load_date>Nov 26 2008  4:06PM</load_date>
                <state>active</state>
            </ncbi_info_archive>
        </info>
[download]

I want to retrieve some information from these files as a hash (key and value).

I am trying to parse this XML file as :

#!/usr/bin/perl

use XML::Simple;
use Data::Dumper;

$xml = new XML::Simple(KeyAttr=>[]);
$data = $xml -> XMLin("InfoFile.xml");
#print Dumper($info);

print "XML read in\n";

foreach $e (@{$data->{info}})
{    
    print $e->{info_name},"\n";
    print "it Size: ", $e->{it_size}, "\n";
    print "\n";
}
[download]

My code is reading the XML file, but is not printing in foreach loop. Can someone please suggest me as where I am going wrong?

Furthermore, I have multiple fasta files and I want to compare the XML key information (i.e. <info_name>...</info_name>) with the Fasta header name. The example of a fasta file is:

>FZGA34177.b1 bg_2167749207  
CATAACAGGAGAGTAAACATGTAACTCCTATAACTCGCGGGGTGTGCTGTTATTACCTCCTTGGTGGAAC
AGGAAACCTGGGAAACGCTTGTTCAGATATTCGTCTGTTTCCCATGTTGCTTCATCTTCAGTGTGGTTTA
>FZGA34178.b1 bg_2167749208  
ACTCTCTTGAGGCATTCACCGGATTGACCGGCGGTGTCCTGGAAGGAGGTGTCCTTCAGGCCTCGTTCAG
TAGCATAGGATTGGCACTAGACCAAATTTTGATCATGGTCAGGATCGAGTGGATCCTGTTTTCTCATTGA
AACTTGGTGACTAATCATTCCTCCCCAGGATCAAAACCATTGATTCAAAAGCAGTGTTTGGCTGGAGAGG
AAAGAAAACAGGGGATCAAATAGAGCTGTACTAGAAAGCAATGAACAGAGCTGGCTAGGATCCAGAGCCA
>FZGA34179.b1 bg_2167749209  
CAGCCTTGGCCGACAGGCCCGGGTAATCTTGGGAAATTTCATCGTGATGGGGATAGATCATTGCAATTGT
TGGTCTTCAACGAGGAATGCCTAGTAAGCGCGAGTCATCAGCTCGCGTTGACTACGTCCCTGCCCTTTGT
ACACACCGCCCGTCGCTCCTACCGATTGAATGGTCCGGTGAAGTGTTCGGATCGCGGCGACGGAGGCGGT
[download]

If the header element of fasta (eg: FZGA34177.b1) matches <info_name> (eg: <info_name>FZGA34177.b1</info_name>), it will check the hash value (eg: <it_size>35000</it_size>) and write the fasta sequence (header and sequence both), to a new file (eg: 35000.fasta.output). Similarly, there will be various other files corresponding to "it_size". The issue is these XML files and fasta files are multiple files, and I thus need to read all of them all together in order to find the sequence corresponding to <it_size>.

Can someone please guide me as to how to go about this problem??

Thanks.

Comment on Parse XML and compare with Fasta in Perl Select or Download Code

Replies are listed 'Best First'.
Re: Parse XML and compare with Fasta in Perl by toolic (Bishop) on Jul 06, 2010 at 20:10 UTC
My code is reading the XML file, but is not printing in foreach loop. It prints out the following for me after I add a closing "t_volume" tag to your XML: `XML read in FZGA34177.b1 it Size: 35000 FZGA34178.b1 it Size: 12000 FZGA34179.b1 it Size: 7000 FZGA34180.b1 it Size: 3000 FZGA34181.b1 it Size: 7000` [download] Download the code you posted, download the XML you posted, fix the XML, then run your code.	[reply] [d/l]
Re^2: Parse XML and compare with Fasta in Perl by ad23 (Acolyte) on Jul 06, 2010 at 20:41 UTC
I still cannot print the required?? I have copied and corrected the XML document too??	[reply]
Re^3: Parse XML and compare with Fasta in Perl by toolic (Bishop) on Jul 06, 2010 at 20:50 UTC
Add this to your code: `print Dumper($data);` [download] Then, post the results, inside readmore tags (Writeup Formatting Tips).	[reply] [d/l]
Re: Parse XML and compare with Fasta in Perl by graff (Chancellor) on Jul 07, 2010 at 02:45 UTC
Ever since I got acquainted with XPath syntax (finally! Why did I wait so long??), and the really excellent GNU LibXML package (which has a thorough and well-documented Perl wrapper XML::LibXML), I'm having a lot more fun with pulling stuff out of XML streams. Below is a little perl script that uses XML::LibXML and it's XPath abilities to provide a generic command-line method for extracting any specific content from an XML file, so long as you can provide the XPath syntax for the content you want. Given that script, the particular task stated in the OP can be accomplished with this command line (assuming the XML data has the required closing tag, as mentioned in a previous reply, and is stored in a file called "test.xml"): `exp -p "//info_name \| //it_size" test.xml # output: FZGA34177.b1 35000 FZGA34178.b1 12000 FZGA34179.b1 7000 FZGA34180.b1 3000 FZGA34181.b1 7000` [download] There's a pretty good reference for XPath usage here: http://www.w3schools.com/XPath/default.asp. The code for my "exp" utility is pretty simple: Read more... (2 kB)	[reply] [d/l] [select]
Re: Parse XML and compare with Fasta in Perl by graff (Chancellor) on Jul 07, 2010 at 03:05 UTC
I forgot to respond to this part of the OP: If the header element of fasta (eg: FZGA34177.b1) matches <info_name> (eg: <info_name>FZGA34177.b1</info_name>), it will check the hash value (eg: <it_size>35000</it_size>) and write the fasta sequence (header and sequence both), to a new file (eg: 35000.fasta.output). Similarly, there will be various other files corresponding to "it_size". The issue is these XML files and fasta files are multiple files, and I thus need to read all of them all together in order to find the sequence corresponding to <it_size>. I'm not sure I follow all that, but it sounds like you want to build an index of your fasta files (think of it as a hash, keyed by the "info_name" strings in the fasta files, and having the sequence strings as values), so that as you get the pairs of "info_name" and "it_size" fields from the XML data, you just look up the info_name in the hash index, and do whatever you need to do with the corresponding fasta sequence strings.	[reply]