in reply to Re^3: multiple XML fields in one line
in thread multiple XML fields in one line

Yep, here is the first part, after the <Iteration_hits> there are lots of <Hit>-s, all with exactly the same structure, so I kept only the first two. The point is to search in the <Hit_def> of these <Hit>-s, but in the output we'd like to see the <Hit_num> and <Hsp_identity> attributes of the matching <Hit>-s too.

<?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww +w.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> <BlastOutput> <BlastOutput_program>blastn</BlastOutput_program> <BlastOutput_version>BLASTN 2.2.29+</BlastOutput_version> <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejan +dro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, an +d David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new gen +eration of protein database search programs&quot;, Nucleic Acids Res. + 25:3389-3402.</BlastOutput_reference> <BlastOutput_db>nr</BlastOutput_db> <BlastOutput_query-ID>96197</BlastOutput_query-ID> <BlastOutput_query-def>No definition line</BlastOutput_query-def> <BlastOutput_query-len>16</BlastOutput_query-len> <BlastOutput_param> <Parameters> <Parameters_expect>1000</Parameters_expect> <Parameters_sc-match>1</Parameters_sc-match> <Parameters_sc-mismatch>-3</Parameters_sc-mismatch> <Parameters_gap-open>5</Parameters_gap-open> <Parameters_gap-extend>2</Parameters_gap-extend> <Parameters_filter>F</Parameters_filter> </Parameters> </BlastOutput_param> <BlastOutput_iterations> <Iteration> <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>96197</Iteration_query-ID> <Iteration_query-def>No definition line</Iteration_query-def> <Iteration_query-len>16</Iteration_query-len> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_id>gi|410994849|gb|CP003920.1|</Hit_id> <Hit_def>Uncultured Sulfuricurvum sp. RIFRC-1, complete genome</Hit_ +def> <Hit_accession>CP003920</Hit_accession> <Hit_len>2358861</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>32.2105</Hsp_bit-score> <Hsp_score>16</Hsp_score> <Hsp_evalue>21.4857</Hsp_evalue> <Hsp_query-from>1</Hsp_query-from> <Hsp_query-to>16</Hsp_query-to> <Hsp_hit-from>1544571</Hsp_hit-from> <Hsp_hit-to>1544556</Hsp_hit-to> <Hsp_query-frame>1</Hsp_query-frame> <Hsp_hit-frame>-1</Hsp_hit-frame> <Hsp_identity>16</Hsp_identity> <Hsp_positive>16</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>16</Hsp_align-len> <Hsp_qseq>ATTCGATCGGTTACTC</Hsp_qseq> <Hsp_hseq>ATTCGATCGGTTACTC</Hsp_hseq> <Hsp_midline>||||||||||||||||</Hsp_midline> </Hsp> </Hit_hsps> </Hit> <Hit> <Hit_num>2</Hit_num> <Hit_id>gi|119500557|ref|XM_001267035.1|</Hit_id> <Hit_def>Neosartorya fischeri NRRL 181 conserved hypothetical protei +n (NFIA_106270) partial mRNA</Hit_def> <Hit_accession>XM_001267035</Hit_accession> <Hit_len>1188</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>32.2105</Hsp_bit-score> <Hsp_score>16</Hsp_score> <Hsp_evalue>21.4857</Hsp_evalue> <Hsp_query-from>1</Hsp_query-from> <Hsp_query-to>16</Hsp_query-to> <Hsp_hit-from>384</Hsp_hit-from> <Hsp_hit-to>399</Hsp_hit-to> <Hsp_query-frame>1</Hsp_query-frame> <Hsp_hit-frame>1</Hsp_hit-frame> <Hsp_identity>16</Hsp_identity> <Hsp_positive>16</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>16</Hsp_align-len> <Hsp_qseq>ATTCGATCGGTTACTC</Hsp_qseq> <Hsp_hseq>ATTCGATCGGTTACTC</Hsp_hseq> <Hsp_midline>||||||||||||||||</Hsp_midline> </Hsp> </Hit_hsps> </Hit> </Iteration_hits> <Iteration_stat> <Statistics> <Statistics_db-num>21147878</Statistics_db-num> <Statistics_db-len>2146527478</Statistics_db-len> <Statistics_hsp-len>0</Statistics_hsp-len> <Statistics_eff-space>0</Statistics_eff-space> <Statistics_kappa>0.710602795216363</Statistics_kappa> <Statistics_lambda>1.37406312246009</Statistics_lambda> <Statistics_entropy>1.30724660390929</Statistics_entropy> </Statistics> </Iteration_stat> </Iteration> </BlastOutput_iterations> </BlastOutput>

Replies are listed 'Best First'.
Re^5: multiple XML fields in one line
by poj (Abbot) on Aug 08, 2014 at 21:36 UTC

    This test program works against your sample data, try running it against the complete file

    #!perl use strict; use warnings; use XML::Simple; use Data::Dump 'pp'; my $blast = XMLin('BLAST1.XML'); my $hits = $blast->{BlastOutput_iterations}->{Iteration}->{Iteration_h +its}->{Hit}; my $ret; #push @ret, $_->{Hit_def} foreach (@{$hits}); foreach (@{$hits}) { push @{$ret},join '|', $_->{Hit_def}, $_->{Hit_num}, $_->{Hit_hsps}->{Hsp}->{Hsp_identity}; } pp $ret;
    poj

      Ah, it's killing me. I tried your test program with my original XML file. Same error as before: 'Not a HASH reference at line 12' (which is: push @{$ret},join '|',).

      I tried it however with the partial file that I sent you. Wow! It works perfectly! I ran again the original program with your modification on the partial XML file. Again, it works perfectly, I get exactly the results I hoped for.

      So is it related to the input file? Maybe my XML file is somehow messed up. So for testing I generated a few more XML files with the appropriate software, but all of them caused this 'Not a HASH reference' error. I compared the complete XML files with the partial XML I sent you, went over and over them like a thousand times, but I couldn't find any difference, except for the number of 'Hit'-s of course, and consequently, the size. Oh, there was one other thing: In the complete XMLs the lines ended with a single newline character (\n), but in the partial XML the EOL was a carriage return and a newline (\r\n). So I replaced all the \n with \r\n, but I still got the error, so the EOL seems to be irrelevant. And with the partial XML the program still worked correctly even if I replaced every \r\n with \n.

      I also tried to shamelessly hack into your code with my limited Perl knowledge, trying different ways to reference, but it only got worse (as had been expected :))

      So all in all, I am totally clueless. I don't get why it should be a HASH reference in the first place; @{$ret} is an array, right? Not a hash. Then I don't get how the input file influences the reference. Especially that in line 12 there is nothing related to the input file, it only says that we will push values into the end of the empty @{$ret} array (and join some of them). And finally I don't get what is the key difference between the 'good' and 'bad' XML files. Why only the partial file is working? If the program runs properly for 2 hits, why it doesn't for 99 hits?

      Mysterious. So much for today, tomorrow I will start removing the hits from a complete XML file one by one, to see if there is a size limit somewhere, or if it has any effect at all...

      Thank you for your selfless help again!

        Look in the XMl file for instances where you have multiple <Hit_hsps> tags within a <Hit> or multiple <Hsp> tags within a <Hit_hsps>.

        This test data replicates your error

        <?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww +w.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> <BlastOutput> <BlastOutput_iterations> <Iteration> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_def>Uncultured Sulfuricurvum sp. RIFRC-1, complete genome</Hit_ +def> <Hit_hsps> <Hsp> <Hsp_identity>16</Hsp_identity> </Hsp> </Hit_hsps> </Hit> <Hit> <Hit_num>2</Hit_num> <Hit_def>Neosartorya fischeri NRRL 181 conserved hypothetical protei +n (NFIA_106270) partial mRNA</Hit_def> <Hit_hsps> <Hsp> <Hsp_identity>16</Hsp_identity> </Hsp> </Hit_hsps> <Hit_hsps> <Hsp> <Hsp_identity>16a</Hsp_identity> </Hsp> <Hsp> <Hsp_identity>16b</Hsp_identity> </Hsp> </Hit_hsps> </Hit> </Iteration_hits> </Iteration> </BlastOutput_iterations> </BlastOutput>
        Update : Try this poj