http://qs1969.pair.com?node_id=211626


in reply to XML::Simple

I know this is an old thread, but it prompted this question in the chatterbox and my response is probably a bit wordy for a chatterbox reply.

In this node it is mentioned that without forcearray the values of the hash produced by XML::Simple will produce arrayrefs in some cases and scalars in other cases... it was mentioned in the node that it did not seem to be a good design decision. What motivated that decision?

I'll start (uncharacteristically) by answering the question: simplicity was the motivation.

I needed an API that made it very easy to work with common forms of XML. For my purposes, the failing of the existing APIs was complexity. Complexity that was born from the need to provide a comprehensive solution which covered all possible cases. I felt that for the common cases, a module could 'guess' what you wanted instead of forcing you to specify in excrutiating detail. Here's a little background...

One frequently asked question in the XML world is "should I store my data in attributes or nested elements?". For example, the data content of this XML...

<person> <firstname>Bob</firstname> <surname>Smith</surname> <dob>18-Aug-1972</dob> <hobby>Fishing</hobby> </person>

... is equivalent to this XML:

<person firstname="Bob" surname="Smith" dob="18-Aug-1972" hobby="Fis +hing" />

Some people prefer the first form and some prefer the second - there is no 'right' answer as long as we assume that there will only ever be one first name, one surname, one date of birth and one hobby. If we list multiple hobbies, then they must be represented as child elements since the rules of XML say an element cannot have two attributes with the same name. So we might end up with something like this:

<person firstname="Bob" surname="Smith" dob="18-Aug-1972"> <hobby>Fishing</hobby> <hobby>Trainspotting</hobby> </person>

To some people, this hybrid form is the obvious and sensible solution. To others, it is ugly and inconsistent. I don't really take a position on that argument and neither does XML::Simple. The XML::Simple API makes it just as easy to access data from nested elements as it is from attributes. It achieves this simplicity by applying simple rules to 'guess' what you want. If you understand the rules then you can provide hints (through options) to ensure the guesses always go your way.

Now to return to our examples, this code

my $person = XMLin($filename)

Will read both the first and second XML documents (above) into a structure like this:

{ firstname => "Bob" , surname => "Smith", dob => "18-Aug-1972", hobby => "Fishing", }

and the third XML document into a structure like this:

{ firstname => "Bob" , surname => "Smith", dob => "18-Aug-1972", hobby => [ "Fishing", "Trainspotting" ] }

By default, XML::Simple always represents an element as a scalar - unless it encounters more than one of them, in which case the scalar is 'promoted' to an array. Obviously it would be a bad thing for your code to have to check whether an element was a scalar or an arrayref before processing it - so don't do that.

One approach to achieving more consistency is to use the 'forcearray' option like this:

my $person = XMLin($filename, forcearray => 1)

which will read the first XML document into a structure like this:

{ firstname => [ "Bob" ], surname => [ "Smith" ], dob => [ "18-Aug-1972" ], hobby => [ "Fishing" ], }

and the third XML document into a structure like this:

{ firstname => "Bob", surname => "Smith", dob => "18-Aug-1972", hobby => [ "Fishing", "Trainspotting" ], }

But a better alternative is to enable forcearray only for the elements which might occur multiple times (ie: influence the guessing process):

my $person = XMLin($filename, forcearray => [ 'hobby' ])

which will consistently read any of the example forms into this type of structure regardless of whether there is only one hobby:

{ firstname => "Bob", surname => "Smith", dob => "18-Aug-1972", hobby => [ "Fishing", "Trainspotting ], }

Given the three possible values for the forcearray option ...

  1. 0 (always 'guess')
  2. 1 (always represent child elements as arrayrefs - even if there's only one)
  3. a list of element names (force named elements to arrayrefs, guess for all others)

... you might well ask why I chose the first option. The truth is that I don't know. The third option is clearly the best for most people, but I couldn't use it as the default since I couldn't know in advance what elements people would want to name. The fact that I chose the worse of the two remaining options hopefully means that a few more people have read the documentation and realised option three is the one they want.

The observant reader will have noted that I said I couldn't use a list of element names as a default for the 'forcearray' option and yet that is precisely what I chose to use as the default value for the 'keyattr' option. I could quote Oscar Wilde at this point ("Consistency is the last resort of the unimaginative") but the truth is, I didn't think people would think to go looking for the 'array folding' feature so I put it somewhere where they could trip over it.