TienLung's question in Arrays and Hashes woes regarding a way to modify some XML and my response using XML::Rules to simplify the datastructure upon reading it and the ugly code to desimplify the datastructure so that it can be converted back to a XML with the same format, got me thinking. Is there a nicer way to produce a XML with a desired format (or schema if you will) from a datastructure that contains all the data you need, but is not structured exactly a as the XML, doesn't contain all the tag names, doesn't distinguish between attributes and tag content etc?

What would you do if you wanted to transform

[ { 'lname' => 'Krynicky', 'fname' => 'Jenda' }, { 'Site' => 'PerlMonks', 'Nick' => 'Jenda' } ];
to
<root> <page> <field> <ID>fname</ID> <value>Jenda</value> </field> <field> <ID>lname</ID> <value>Krynicky</value> </field> </page> <page> <field> <ID>Site</ID> <value>PerlMonks</value> </field> <field> <ID>Nick</ID> <value>Jenda</value> </field> </page> </root>
? You can assume the order of the <field> tags and the child tags of <field> doesn't matter.

In Re: Arrays and Hashes woes I used two nested map()s with some anonymous arrays and hashes to convert the datastructure to one that would end up looking like this if printed by XML::Rules, but it's ... not a light reading. What would you use?

I was considering a (yes, yet another) template engine (my second actually, the first one was for RTF) geared at this kind of conversion, without the need for complex XPath expressions etc. Here are some examples of what I think it might look like:

Assuming this data structure:

[ { PageId => 1, Name => 'Civil name', 'lname' => 'Krynicky', 'fname' +=> 'Jenda' }, { PageId => 2, Name => 'Online identity', 'Site' => 'PerlMonks', 'Ni +ck' => 'Jenda' } ];
TemplateResult (reformated to better fit the table)
<root xmlns:cmd="http://jenda.krynicky.cz/XML/Rules/Template/Cmd" xmlns:set="http://jenda.krynicky.cz/XML/Rules/Template/Set" xmlns:opt="http://jenda.krynicky.cz/XML/Rules/Template/Opt"> <page cmd:foreach> <field cmd:foreachkey> <ID set:_content="$key"></ID> <value set:_content="$value"></value> </field> </page> </root>
<root> <page> <field><ID>PageId</ID><value>1</value></field> <field><ID>Name</ID><value>Civil name</value></field> <field><ID>lname</ID><value>Krynicky</value></field> <field><ID>fname</ID><value>Jenda</value></field> </page> <page> <field><ID>PageId</ID><value>2</value></field> <field><ID>Name</ID><value>Online identity</value></field> <field><ID>Site</ID><value>PerlMonks</value></field> <field><ID>Nick</ID><value>Jenda</value></field> </page> </root>
<root xmlns:...> <page cmd:foreach set:id="$_->{PageId}"> <field cmd:foreachkey="! /^PageId$/"> <ID set:_content="$key"></ID> <value set:_content="$value"></value> </field> </page> </root>
<root> <page id="1"> <field><ID>Name</ID><value>Civil name</value></field> <field><ID>lname</ID><value>Krynicky</value></field> <field><ID>fname</ID><value>Jenda</value></field> </page> <page id="2"> <field><ID>Name</ID><value>Online identity</value></field> <field><ID>Site</ID><value>PerlMonks</value></field> <field><ID>Nick</ID><value>Jenda</value></field> </page> </root>
<root> <page cmd:foreach set:id="$_->{PageId}"> <field cmd:foreachkey="! /^PageId$/"> <ID set:_content="$key"></ID> <value set:_content="$value" opt:optional></value> <!-- this means that the tag is skipped if the $value is undef or empty string --> </field> </page> </root>
Same as above in this case.
<root> <page cmd:foreach set:id="$_->{PageId}" opt:required> <field cmd:foreachkey="! /^PageId$/"> <ID set:_content="$key"></ID> <value set:_content="$value" opt:if="defined $value"></value> <!-- this means that the tag is skipped only if the value is undef --> </field> </page> </root>
Same as above in this case.
<root> <page cmd:foreach set:id="$_->{PageId}" opt:required> <!-- the tag would be required even if the datastructure was empty - +-> <field cmd:foreachkey="! /^PageId$/"> <cmd:insert set:_tag="$key" set:_content="$value"/> </field> </page> </root>
<root> <page id="1"> <Name>Civil name</Name> <lname>Krynicky</lname> <fname>Jenda</fname> </page> <page id="2"> <Name>Online identity</Name> <Site>PerlMonks</Site> <Nick>Jenda</Nick> </page> </root>
<root> <page cmd:foreach set:id="$_->{PageId}"> <name cmd:forkey="Name" set:_content="$_" opt:required/> <field cmd:foreachkey="! /^(PageId|Name)$/"> <ID set:_content="$key"></ID> <value set:_content="$value"></value> </field> </page> </root>
<root> <page id="1"> <Name>Civil name</Name> <field><ID>lname</ID><value>Krynicky</value></field> <field><ID>fname</ID><value>Jenda</value></field> </page> <page id="2"> <Name>Online identity</Name> <field><ID>Site</ID><value>PerlMonks</value></field> <field><ID>Nick</ID><value>Jenda</value></field> </page> </root>

Here's a quick and dirty explanation of the commands and options:

cmd:
(attributes)
foreach = if the $_ is an array ref, repeat that tag and for the children set $_ to the elements of the array
foreachkey = if the $_ is a hash ref, repeat that tag and for the children set $key to the keys of the hash and $value to the values
  - if the attribute has any value, then it's used as a condition and all elements that do not match the condition are skipped
forkey = if the $_ is a hash ref, evaluate the tag for all keys returned by the code in this attribute and set the $_ to the values
  it the $_ is an array ref, evaluate the tag for all elements with IDs returned by the code, set $_ to the elements
forkeys = if the $_ is a hash ref, evaluate the tag for all keys returned by the code in this attribute and set the $key to the keys and $value to the values

(tags)
insert = evaluate as if it was any other tag, but then either print the tag with the name specified by set:_tag or only the content and/or children
set:
(attributes)
_content : evaluate the expression and put the result into content, if the tag has children than the content is output first
<anything else> : evaluate the expression and set that attribute
  in a cmr:foreach or cmr:foreachkey marked tag evaluate the values for the attributes with $_/$key+$value set as for the child tags
opt:
required - used for tags with cmd:foreach or cmd:foreachkey, the tag will be printed even if there are no (matching) elements
optional - the tag is printed only if at least one set:attribute or set:_content produced some text
if - the tag and its children are printed only if the condition holds
unless - the tag and its children are printed only if the condition doesn't hold

Did I go crazy? Or does it remind you of something that's already implemented? Or do you actually think this could be worth implementing? And have some suggestions for additional features ...

Update on Mar 18, 2009 at 02:50 GMT-2: I would like to have a more general solution than just for a AoH as presented in the example. I'd like to support any level of any combination of arrays and hashes, even irregular. Array of hashes that contain keys whose values are scalars, others pointing to arrayrefs, yet others to hashrefs, ... Assuming the structure is consistent and well known, but possibly complex.

Replies are listed 'Best First'.
Re: Datastructures to XML
by ELISHEVA (Prior) on Mar 17, 2009 at 18:17 UTC

    Perhaps it is just me, but this seems quite complicated when the goal is to simplify generating XML for record sets. Many people find looping constructs (foreach, etc) hard enough when they are close to the actual data. Abstracting them via a template I feel would only add to the confusion.

    I'm also not so crazy about all of the options to control filtering of hash keys. The problem with templates like this is that you have to anticipate all of the possible key selection behavior. A routine that took a simple filter function, rather than all of those template options, would give you all of the flexibility of grep and let you reuse what you already know about Perl to do the filtering.

    For most use cases there are only a handful of structurally distinct ways (about 6) to convert an array of hashes into a set of XML records. Beyond that most of the variation comes either from tag names, or the need to filter out certain hash members. On that basis, I think you could get away with something much more simple than templates.

    I personally would prefer a routine that I could call like this:

    my $sXML = genRecord($aData); #default options #or print genRecord($aData, $hOptions);

    where options let me set something to filter keys, change tag and attribute names, and decide whether field values should be attribute values, or element text. It would be much easier to use (and explain in SOPW replies).

    The following module definition (265 lines of POD and examples, 102 lines of code) would accomplish all that:

    The other advantage of a functional approach like the one above is that it is quite easy to enhance to handle things like complex objects and records nested within records. One would only need to add an entry in the option hash that stored a hash that itself contained option hashes keyed by field name. When a field name was found in that hash, the value would be printed out by calling genRecord(...) recursively.

    I've also attached a small demo script outputting the data for the specific example that you give:

    Best, beth

      Who said anything about records? Who said the datastructure is gonna be AoH? I would like to support any kind of datastructure. And be able to produce even more complex (read crazy) XML. I think you'd find this style getting quickly out of hand as soon as you attempted to support AoA,HoA,HoH,HoHoH,HoAoH, ... or as soon as you needed to produce things like

      <record id="1"> <foo>Hello</foo> <bar>World</bar> </record>
      OK, so you add a way to specify which "field" is gonna become an attribute of the record tag. Then you find out you sometimes need more. That sometimes the name of the field doesn't match the name of the attribute. ...

      For the simpler task of converting AoH to (more or less) record based XML your solution is probably simpler. Whether easier to use I'm not so sure. The templates as I see them, let the user specify how does he/she want the result to look like and then mark what is to be repeated for the A and what for the H, where to put the key and where the value from the hash, specify the static tags or data, etc.

      Thanks for the comment anyway of course, I actually think your module might be a nice little addition to CPAN. Or maybe it could be added to XML::Records or XML::RAX. As a means to go the other direction than what the modules were originally made for.

        Who said anything about records? Who said the datastructure is gonna be AoH?

        Well,your question began: how would you convert this data, and this data was AoH. And you also stated that your inspiration for the meditation was SOPW that was also about an AoH. Conceptually, many people think of an AoH as an array of records, hence the terminology "record".

        I would like to support any kind of datastructure.

        That is an excellent goal, but it sits in tension with the goal of "making it easy" - as you observe about writing documentation when you responded to zentara. Much of the appeal of templating solutions comes from the fact that it is sort of WYSIWYG - there is less guess work in what the output will be. However, this benefit usually lessens when templating languages try to add more and more features to handle various edge cases. As extra syntax accumulates the template begins to look less and less like the actual output.

        I think you'd find this style getting quickly out of hand as soon as you attempted to support AoA,HoA,HoH,HoHoH,HoAoH, ... or as soon as you needed to produce things like...

        If you want to handle more complex data structures, it can be done without option keys multiplying like rabbits or creating hundreds of format constants. The key is to understand the meta structure of the problem. As with the templating approach, you need to do two things:

        • Provide a way to handle H, A, AoA, and AoH
        • Provide a way to handle array elements and hash values that are non-scalar: references to H, A, AoA, AoH, and blessed objects

        With those two ingredients you can handle data structures of any complexity - whether you are using a functional approach or a templating approach. Handling fields that have non-scalar values was briefly discussed (though perhaps not very clearly) in my note above in the paragraph about adding support for recursion and a option hash key that stored a value representation rule: a hash reference storing option hashes keyed by field name or regex. This is really little more than a change in representation from the templating approach: the regex that appears in your template becomes the hash key; the written out example XML becomes the option hash assigned to the regex.

        As for H,A,AoH,AoA. AoH is already handled. H is equivalent to AoH with one element, so modifying genRecord(...) to work with H rather than AoH is trivial. A is also equivalent to AoA with one element. So that leaves AoA and objects. To support AoA, you would need to deal with two scenarios: (a) indexes get mapped to names (b) each array element gets mapped to a nested element that differs only in the value assigned to it. (a) can be handled by modifying the filter function to return the field name rather than just a boolean value. Alternatively one could add a option hash key "indexToName" that has an array reference listing the field names in order. (b) can be handled by adding support for two additional format constants: NO_NAME_ATTR_VAL, NO_NAME_TEXT_VAL

        Blessed objects raise other issues: (a) is the object opaque or can you just extract the data associated with the underlying blessed reference? (b) if the object is opaque, one needs to identify which methods should be used as getter methods. Whether one adopts a functional or templating approach one will still need to find a way to provide the same information: (opaque or not/which methods are getters).

        OK, so you add a way to specify which "field" is gonna become an attribute of the record tag. Then you find out you sometimes need more. That sometimes the name of the field doesn't match the name of the attribute. ...

        Some people would handle things that way, but I wouldn't. The functional approach I described actually does allow one to put "fields" in as attributes already as well as a number of other XML syntax variations. Check the format choices for details. What it didn't allow you to do is rename "fields" or pick and choose which fields are record attributes and which are nested elements.

        However, providing support would require (for the user) no more than a minor modification to the filter function. Instead of returning a simple boolean, the filter function would return:

        • undef if the field should be skipped
        • the name of the field if the field should be included in the record. If only a name is returned the field will be a record attribute or nested element according to the option hash passed into genRecord(...)
        • a reference to an array or hash containing two bits of data: the field name and an option hash that overrides the one passed into genRecord(...) but only for that particular field.

        I would consider a souped up filter function a better choice than the regex being used in the template spec because the logic involved in the choosing field names and field placement (attribute/nested element) may not be reducible to a regex.

        For the simpler task of converting AoH to (more or less) record based XML your solution is probably simpler. Whether easier to use I'm not so sure

        Each person has their own style and you (and probably many others) may simply prefer templating. I find templating approaches more limiting and "less easy", mainly because (a) even when there are obvious defaults, one still needs to spell out everything in a template - one could say it lacks Huffman encoding. (b) when I really do need to do fancy things like treat some fields as record attributes and some as nested elements (or rename fields) my logic may not be reducible to a regex. A filter function gives me the full power of Perl, including closures. (c) the implementation is more re-usuable. I can always layer a template language and parser over the functional approach. But I can also experiment with other ease of use interfaces.

        Best, beth

        Update: moved discussion of WYSIWYG and templating to start of post.

Re: Datastructures to XML
by zentara (Cardinal) on Mar 17, 2009 at 16:46 UTC
    Wow, been looking how to approach that for awhile. I somehow knew it would take someone like Jenda to figure it out. Nested references in the hash values seem problemsome.

    I'm not really a human, but I play one on earth My Petition to the Great Cosmic Conciousness

      I guess I can take this as a suggestion that it would be worth it to implement this, right? :-)

      You mean if the data look like this:

      [ { 'lname' => 'Krynicky', 'fname' => 'Jenda', 'PageId' => 1, 'Details' => { Header => 'blah blah', Footer => 'bla bla bla', } }, ... ];
      ? You could use a template like this for example:
      <root> <page cmd:foreach set:id="$_->{PageId}" opt:required> <field cmd:foreachkey="! /^(PageId|Details)$/"> <cmd:insert set:_tag="$key" set:_content="$value"/> </field> <details cmd:forkey="Details"> <cmd:insert cmd:foreachkey set:_tag="$key" set:_content="$value"/ +> </details> </page> </root>

      That is you make sure the foreachkey in the <field> tag skips both the PageId and Details keys, include the <details> tag to present the values of the 'Details' key and finally foreachkey directly in the <cmd:insert> tag to print the data from the inner hash.

      Is this what you meant? I assume writing the documentation for this will be quite hard. Just like the docs for XML::Rules were. And I don't think I did a perfect job there.

        Great, Ok, Mr. SmartyPants, now make it so those arrays can contain other hashrefs and arrayrefs, which themselves can contain any combo of references.

        I was just thinking.... is there a "recursive glob"? Like a glob to wildcard all subdirs (infinite depth)...but I think I will post a new node for something like that. :-)


        I'm not really a human, but I play one on earth My Petition to the Great Cosmic Conciousness
Re: Datastructures to XML
by locked_user sundialsvc4 (Abbot) on Mar 18, 2009 at 15:55 UTC

    To my way of thinking, “references” are not a concept that an external data-structure really knows about.

    To see a shining example of what does work, we need look no further than the most-common external data structure of all:   an SQL table database.

    A database, of course, consists of a collection of tables. But the tables do not contain “pointers to” one another... as their hierarchial pointer-based ancestors did. The overall data relationship is described using a set of distinct tables, and where two the information in one table is semantically related to the information in another, the relationship is expressed in a commonality of key-values.

    You can quite-easily do the same thing here. Store your information, not as one XML-tree, but several. Attach to each node a “primary key” of some kind... it could be a Data::GUID or some derivative thereof. Then, when one tree needs to refer to another, it does so by mentioning its “foreign key.”

    The software that works with these trees could, if necessary, use a Perl hash to refer to them ... or, if the amount of data is known to be manageable, it could just use XSL queries. You can construct references in your Perl nodes, as long as you take care to make sure that they are all appropriately weak or strong. In either case, the entire “references problem” simply goes away with respect to your external data representation.

    If your data representation is XML, then it is highly-desirable to take the time to make your XML well-formed. You gain a lot of benefits by describing a formal schema and sticking to it. The biggest of these benefits is that both the incoming and the outgoing data is “known good.” If the incoming data validates against its schema, you can rely upon that validation. If you are building a data structure that's supposed to conform to a schema, exceptions can be thrown if they don't ... ergo, if exceptions do not occur, the code that builds the data-structure must indeed be building a conformant structure. (Anytime you are looking for bugs in complex code such as this, it helps immensely to know where the bugs are (probably...) not.)

      Re the database example: I would consider this an implementation detail. It doesn't really make a difference whether you know the address of another object (in the general sense) in memory or its ID within some collection. Or (in case of N-N relations) a list of pairs of IDs of objects.

      If I could store the data in any format I wished, it would be much easier, but sometimes I do not control the format. And sometimes even if I do it doesn't map to the data structure that best suits the needs of the task at hand directly.

      I think I should have explained better what I am really after. I'd like to have a "reverse to XML::Rules". That is with XML::Rules I can tweak the tree structure of the data from a XML so that I can work more easily with the resulting structure. Where the original structure may be designed with a different task in mind or just be more general. Then I would like to have some reasonably simple way to "convert" the datastructure back to the original format. Or for that matter to a different format, but quite often one that was not designed for this particular task.

        It may be an implementation detail, but it is a very important one that can have a significant impact on a general purpose "data structure to XML" converter.

        Sometimes data structures constructed in memory use pointers in place of ids. When this is converted to a persistent form (XML or otherwise), one must create some sort of id that corresponds to the pointer or reference. One will also need to decide on a name for the tag or attribute that holds the generated id since there is no corresponding array or hash element to "foreach". Otherwise there will be information loss.

        In some cases one can just assign sequential ids. Other formats might require GUID generation. Others might want a registered URI. Still other XML formats require that the id match something in a database or flat file. To get the right id one might need to do a look up on a "soft" id - for example a person's first and last name or their social security or passport number. Or one might need to add a new record to the database and capture the id assigned by the database.

        A second issue that I think sundialsvc4 was getting at was placement of XML elements. Both your template spec (and my functional alternative) assume a part-container model: elements nested within elements.

        But sundialsvc4 is reminding us of an extremely important and common alternative: the relational model. In the relational model, big ugly objects aren't nested. They are replaced by foreign key fields. The XML for the big-ugly-object is defined elsewhere, perhaps even in a different file. The two may be connected either by matching field values (a la a relational DBMS) or by "references" - the value assigned to the id attribute of the big-ugly-object-in-another-file.

        Because part-container models are easier to conceptualize, XML schemas often start life using a part-container model and then migrate over time to one that supports more of a relational model (less duplication of big-ugly-objects). For a readily available open source example, study the history of the XML format used with the ant build tool. Incidentally, the history of DBMS implementation also follows this progression (anybody remember CISC-ISAM databases?)

        Any general purpose tool would be wise to support both (or clearly explain its limits in the CAVEATS section of its POD). Otherwise a company using Mondo::Wonderous::Data::XML might find that they have to throw out, rather than modify, their XML generation code as their XML schemas mature.

        Best, beth

        It is not, strictly speaking, “an implementation detail.” When you are dealing with external data collections (be they SQL tables, or XML files or whatever), the notion of “addresses” (hence: references) does not exist. The notion of “keys,” of whatever format you wish, does.

        If you have ever had the unfortunate experience of dealing with an IMAGE or an IDMS database in any past-life you'd much rather forget, then you will know exactly what I am talking about...   :-D

        (You do not, of course, have to answer that. Many I.S. memories are much better left buried in the past.)