nickschurch has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks-of-perliness, I'm having a problem getting the data out of an xml file with XML::simple.

The xml file contains information in the form:

<project by="company" name="personname"> <pattern name="company-000001" owner="company" description="Microarr +ay" species_database="d.base"> <reporter name="A_24_P344666" systematic_name="NM_020341"> <feature number="1780"> <position x="0.733234841870825" y="10.033" units="mm" /> </feature> <gene systematic_name="NM_020341" primary_name="PAK7" descriptio +n="Homo sapiens p21(CDKN1A)-activated kinase 7 (PAK7), transcript var +iant 1, mRNA [NM_020341]"> <accession database="ref" id="NM_020341" /> <accession database="ref" id="NM_177990" /> <accession database="ens" id="ENST00000378429" /> <accession database="ens" id="ENST00000378423" /> <other name="accessions" value="ref|NM_020341|ref|NM_177990|en +s|ENST00000378429|ens|ENST00000378423" /> <other name="chr_coord" value="chr20:9466136-9466077" /> </gene> </reporter> <reporter ... </reporter> <reporter ... </reporter> </pattern> </project>


I want to pick out the different names (instances of name, systematic_name & primary_name) put then into an array and then store this array as a value in a hash with the key being the reporter attribute: name.

If I read this into perl with XML::simple and don't specify KeyAttr, then when I try:  my @probekeys = keys %{$data->{pattern}->{reporter}};

then it tried to use name as the hash keys and reads these into an array (which is what I want) but it falls over because name turns out not to be unique.

If I read this into perl with XML::simple and specify a different attribute as the key attribute using KeyAttr, then when I try:  my @probekeys = keys %{$data->{pattern}->{reporter}};

Then XML::simple insists on forcing things into arrays (even if I use ForceArray => 0) and I keep getting told that pseudo-hashes have been depreciated.... Does anyone have any smart ideas about how I can get hold of this information without just reading the 500k line file in one line at a time?

Replies are listed 'Best First'.
Re: XML::Simple and pseudo hashes...
by toolic (Bishop) on Jun 19, 2009 at 17:14 UTC
    It is not completely clear to me what output you are trying to achieve (a small sample of the output you expect would make it clear), but perhaps a different approach using a different module (XML::Twig) could help you to avoid this problem. I realize this does not directly answer your XML::Simple question, but it is something you could consider.
    use strict; use warnings; use XML::Twig; use Data::Dumper; my $xmlStr = <<XML; <project by="company" name="personname"> <pattern name="company-000001" owner="company" description="Microarr +ay" species_database="d.base"> <reporter name="A_24_P344666" systematic_name="NM_020341"> <feature number="1780"> <position x="0.733234841870825" y="10.033" units="mm" /> </feature> <gene systematic_name="NM_020341" primary_name="PAK7" descriptio +n="Homo sapiens p21(CDKN1A)-activated kinase 7 (PAK7), transcript var +iant 1, mRNA [NM_020341]"> <accession database="ref" id="NM_020341" /> <accession database="ref" id="NM_177990" /> <accession database="ens" id="ENST00000378429" /> <accession database="ens" id="ENST00000378423" /> <other name="accessions" value="ref|NM_020341|ref|NM_177990|en +s|ENST00000378429|ens|ENST00000378423" /> <other name="chr_coord" value="chr20:9466136-9466077" /> </gene> </reporter> <reporter name="foo" systematic_name="boo"> <feature number="1780"> <position x="0.733234841870825" y="10.033" units="mm" /> </feature> <gene systematic_name="boo" primary_name="goo" description="Homo + sapiens p21(CDKN1A)-activated kinase 7 (PAK7), transcript variant 1, + mRNA [NM_020341]"> <accession database="ref" id="NM_020341" /> <accession database="ref" id="NM_177990" /> <accession database="ens" id="ENST00000378429" /> <accession database="ens" id="ENST00000378423" /> <other name="accessions" value="ref|NM_020341|ref|NM_177990|en +s|ENST00000378429|ens|ENST00000378423" /> <other name="chr_coord" value="chr20:9466136-9466077" /> </gene> </reporter> </pattern> </project> XML my %data; my $twig= new XML::Twig( twig_handlers => { reporter => \&reporter } ); $twig->parse($xmlStr); print Dumper(\%data); exit; sub reporter { my ($twig, $rep) = @_; my $name = $rep->att('name'); my $sname = $rep->att('systematic_name'); my $pname = $rep->first_child('gene')->att('primary_name'); $data{$name} = [$name, $sname, $pname]; } __END__ $VAR1 = { 'A_24_P344666' => [ 'A_24_P344666', 'NM_020341', 'PAK7' ], 'foo' => [ 'foo', 'boo', 'goo' ] };
      With a bit of playing, XML::Twig has done the trick and given me what I want. Iget some weird behaviour after it though... The script keeps running and everything looks good, but then I get a segmentation fault as the script ends.

      I know its as it end cos everything is fine, including the print "... fininshed!\n"; final line, before it segmentation faults. Weird. Still, I don't s'pose it matter at that point.
Re: XML::Simple and pseudo hashes...
by Jenda (Abbot) on Jun 20, 2009 at 17:46 UTC

    If you use Data::Dumper to print the datastructure, you'll probably find out that it looks different that you expected. Try to use XML::Rules instead, it goves you more detailed control over the generated structure.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.