XML data structures and XML::Simple

matth has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: XML data structures and XML::Simple by grantm (Parson) on Dec 19, 2002 at 01:32 UTC
The key to all of this is the output from Data::Dumper. If you don't understand what it's telling you, then you will not be able to progress, regardless of anything we tell you. I recommend you read the perlreftut document which is a tutorial on Perl references. When you've read that, the following should make sense... The output of Dumper look like this: `$VAR1 = { 'gene' => { '1' => { 'gene_seq' => { 'startpos' => '5999', 'id' => '1' }, 'label' => 'gene_of_interest' }, '2' => { 'gene_seq' => { 'startpos' => '96819', 'id' => '2' }, 'label' => 'Another_gene_of_interest' } } };` [download] Now the '$VAR1' bit might be slightly confusing, but when you pass a reference (the value of $xml in your case) to Dumper(), the Dumper function just gets the reference, it doesn't get the variable name, so it makes one up - in this case '$VAR1'. The output that follows is what is in the variable that you know as $xml, but Dumper() is calling $VAR1. The next thing you see is that $VAR1 is equal to something in curly braces: $VAR1 = { ... }; That means the $VAR1 is a reference to a hash. If it had been a reference to an array, then we would have seen $VAR1 = [ ... ]. The next thing you can see is that the hash has only one key: 'gene'. Therefore if we want to get anything out of the hash we need to start with: `$xml->{gene}` [download] The next thing we see is that the value associate with the 'gene' key is itself a hash (it starts with curly braces). That hash has exactly two keys: '1' and '2', so to get anything out of it we need to do one of these: `$xml->{gene}->{1} or $xml->{gene}->{2}` [download] The values associate with each of these keys are both hashes as well (remember the curly brackets). Each of the hashes has two keys: 'gene_seq' and 'label'. So to go further, we will be using one of these forms: `$xml->{gene}->{1}->{gene_seq} $xml->{gene}->{1}->{label} $xml->{gene}->{2}->{gene_seq} $xml->{gene}->{2}->{label}` [download] The value associated with each of the 'label' keys is a simple string value (it's in single quotes rather than curly braces or square brackets). So you could just print it out. The value associated with each of the 'gene_seq' keys is another hash (curly brackets) and each hash has two keys: 'startpos' and 'id'. So to access a particular piece of data you'll need one of these forms: `$xml->{gene}->{1}->{gene_seq}->{startpos} $xml->{gene}->{1}->{gene_seq}->{id} $xml->{gene}->{2}->{gene_seq}->{startpos} $xml->{gene}->{2}->{gene_seq}->{id}` [download] Note that the '->' bit is optional. I can't be sure what you were trying to do in your code since you have two different types of 'id': one in the <gene> element and one in the <gene_seq> but there's only one $id in your code - I can't see how you ever expected that to work. Having said all of that, I am quite certain that the data structure we've just examined in painstaking detail is not what you want anyway. I'm not even sure that the XML you gave us looks anything like what you need but we'll assume it does for now... It is painfully clear that you have not read the XML::Simple documentation or the XML::Simple::FAQ. If you had, you would have seen some mention of 'array folding'. It doesn't matter whether you know what array folding is or not, the docs are quite clear: if you don't know what it is, you should explicitly turn it off using the keyattr option like this: `my $xml = XMLin($data, keyattr => [ ]);` [download] If you had read the documentation and did understand what array folding was and decided that you wanted it, you would know that you should explicitly turn it on like this: `my $xml = XMLin($data, keyattr => { gene => 'id', gene_seq => 'id' } +);` [download] Furthermore, you would also understand that you should never use 'keyattr' without also using 'forcearray': `my $xml = XMLin($data, keyattr => { gene => 'id', gene_seq => 'id' }, forcearray => [ 'gene', 'gene_seq' ] );` [download] But don't just take this code and paste it in as another question here on Perl Monks (like you have with every other code snippet people have offered you so far). I simply don't know whether that combination of options is right for you because I don't know what you're trying to do. You know. You know how many 'genes' you want. You know how many 'gene_seq's a 'gene' might have. You know whether each 'gene' has a unique identifier. You know whether each 'gene_seq' has a unique identifier. You know what you want to do with the data once you've parsed it. We don't know and therefore we can't write the code for you. Asking questions on Perl Monks is not a substitute for reading the documentation. If you read it and don't understand it, ask a question here and quote the bit you don't understand.	[reply] [d/l] [select]
Re: Re: XML data structures and XML::Simple by matth (Monk) on Dec 19, 2002 at 04:23 UTC
Your right. I'll look over the documentation and your points here will make that process easier, so thanks for raising them. The code I am working with is working nicely now.	[reply]
Re: XML data structures and XML::Simple by dempa (Friar) on Dec 19, 2002 at 00:10 UTC
First, there are some issues with that code. You declare (using 'my') $xml twice. In this particular example you can comment out the first one since you don't use the OO version of XMLin anyway. Anyway, the answer to your question is in the output from Data::Dumper. `print $xml->{gene}{$id}{gene_seq}{'startpos'}, "\n";` If you wanted the keeproot option you could supply it directly to XMLin like this: `my $xml = XMLin($data, keeproot => 1);` But that would require the print statement to change to: `print $xml->{many_genes}{gene}{$id}{gene_seq}{'startpos'}, "\n";` -- dempa	[reply] [d/l] [select]