Re: Datastructures to XML
by ELISHEVA (Prior) on Mar 17, 2009 at 18:17 UTC
|
Perhaps it is just me, but this seems quite complicated when the goal is to simplify generating XML for record sets. Many people find looping constructs (foreach, etc) hard enough when they are close to the actual data. Abstracting them via a template I feel would only add to the confusion.
I'm also not so crazy about all of the options to control filtering of hash keys. The problem with templates like this is that you have to anticipate all of the possible key selection behavior. A routine that took a simple filter function, rather than all of those template options, would give you all of the flexibility of grep and let you reuse what you already know about Perl to do the filtering.
For most use cases there are only a handful of structurally distinct ways (about 6) to convert an array of hashes into a set of XML records. Beyond that most of the variation comes either from tag names, or the need to filter out certain hash members. On that basis, I think you could get away with something much more simple than templates.
I personally would prefer a routine that I could
call like this:
my $sXML = genRecord($aData); #default options
#or
print genRecord($aData, $hOptions);
where options let me set something to filter keys, change tag and attribute names, and decide whether field
values should be attribute values, or element text. It
would be much easier to use (and explain in SOPW replies).
The following module definition (265 lines of POD and examples, 102 lines of code) would accomplish all
that:
The other advantage of a functional approach like the
one above is that it is quite easy to enhance to handle things like complex objects and records nested within records. One would only need to add an entry in the
option hash that stored a hash that itself contained
option hashes keyed by field name. When a field name
was found in that hash, the value would be printed out
by calling genRecord(...) recursively.
I've also attached a small demo script outputting the
data for the specific example that you give:
Best, beth | [reply] [d/l] [select] |
|
|
Who said anything about records? Who said the datastructure is gonna be AoH? I would like to support any kind of datastructure. And be able to produce even more complex (read crazy) XML. I think you'd find this style getting quickly out of hand as soon as you attempted to support AoA,HoA,HoH,HoHoH,HoAoH, ... or as soon as you needed to produce things like
<record id="1">
<foo>Hello</foo>
<bar>World</bar>
</record>
OK, so you add a way to specify which "field" is gonna become an attribute of the record tag. Then you find out you sometimes need more. That sometimes the name of the field doesn't match the name of the attribute. ...
For the simpler task of converting AoH to (more or less) record based XML your solution is probably simpler. Whether easier to use I'm not so sure. The templates as I see them, let the user specify how does he/she want the result to look like and then mark what is to be repeated for the A and what for the H, where to put the key and where the value from the hash, specify the static tags or data, etc.
Thanks for the comment anyway of course, I actually think your module might be a nice little addition to CPAN. Or maybe it could be added to XML::Records or XML::RAX. As a means to go the other direction than what the modules were originally made for.
| [reply] [d/l] |
|
|
Who said anything about records? Who said the datastructure is gonna be AoH?
Well,your question began: how would you convert this data, and this data was AoH. And you also stated that your inspiration for the meditation was SOPW that was also about an AoH. Conceptually, many people think of an AoH as an array of records, hence the terminology "record".
I would like to support any kind of datastructure.
That is an excellent goal, but it sits in tension with the goal of "making it easy" - as you observe about writing documentation when you responded to zentara. Much of the appeal of templating solutions comes from the fact that it is sort of WYSIWYG - there is less guess work in what the output will be. However, this benefit usually lessens when templating languages try to add more and more features to handle various edge cases. As extra syntax accumulates the template begins to look less and less like the actual output.
I think you'd find this style getting quickly out of hand as soon as you attempted to support AoA,HoA,HoH,HoHoH,HoAoH, ... or as soon as you needed to produce things like...
If you want to handle more complex data structures, it can be done without option keys multiplying like rabbits or creating hundreds of format constants. The key is to understand the meta structure of the problem. As with the templating approach, you need to do two things:
- Provide a way to handle H, A, AoA, and AoH
- Provide a way to handle array elements and hash values that are non-scalar: references to H, A, AoA, AoH, and blessed objects
With those two ingredients you can handle data structures of any complexity - whether you are using a functional approach or a templating approach. Handling fields that have non-scalar values was briefly discussed (though perhaps not very clearly) in my note above in the paragraph about adding support for recursion and a option hash key that stored a value representation rule: a hash reference storing option hashes keyed by field name or regex. This is really little more than a change in representation from the templating approach: the regex that appears in your template becomes the hash key; the written out example XML becomes the option hash assigned to the regex.
As for H,A,AoH,AoA. AoH is already handled. H is equivalent to AoH with one element, so modifying genRecord(...) to work with H rather than AoH is trivial. A is also equivalent to AoA with one element. So that leaves AoA and objects. To support AoA, you would need to deal with two scenarios: (a) indexes get mapped to names (b) each array element gets mapped to a nested element that differs only in the value assigned to it. (a) can be handled by modifying the filter function to return the field name rather than
just a boolean value. Alternatively one could add a option hash key "indexToName" that has an array reference listing the field names in order. (b) can be handled by adding support for two additional format constants: NO_NAME_ATTR_VAL, NO_NAME_TEXT_VAL
Blessed objects raise other issues: (a) is the object opaque or can you just extract the data associated with the
underlying blessed reference? (b) if the object is opaque, one needs to identify which methods should be used as getter methods. Whether one adopts a functional or templating approach one will still need to find a way to
provide the same information: (opaque or not/which methods are getters).
OK, so you add a way to specify which "field" is gonna become an attribute of the record tag. Then you find out you sometimes need more. That sometimes the name of the field doesn't match the name of the attribute. ...
Some people would handle things that way, but I wouldn't. The functional approach I described actually does allow one to put "fields" in as attributes already as well as a number of other XML syntax variations. Check the format choices for details. What it didn't allow you to do is rename "fields" or pick and choose which fields are record attributes and which are nested elements.
However, providing support would require (for the user) no more than a minor modification to the filter function. Instead of returning a simple boolean, the filter function would return:
- undef if the field should be skipped
- the name of the field if the field should be
included in the record. If only a name is returned
the field will be a record attribute or nested
element according to the option hash passed into
genRecord(...)
- a reference to an array or hash containing two bits
of data: the field name and an option hash that
overrides the one passed into genRecord(...)
but only for that particular field.
I would consider a souped up filter function a better choice than the regex being used in the template spec because the logic involved in the choosing field
names and field placement (attribute/nested element) may not be reducible to a regex.
For the simpler task of converting AoH to (more or less) record based XML your solution is probably simpler. Whether easier to use I'm not so sure
Each person has their own style and you (and probably many others) may simply prefer templating. I find templating approaches more limiting and "less easy", mainly because (a) even when there are obvious defaults, one still needs to spell out everything in a template - one could say it lacks Huffman encoding. (b) when I really do need to do fancy things like treat some fields as record attributes and some as nested elements (or rename fields) my logic may not be reducible to a regex. A filter function gives me the full power of Perl, including closures. (c) the implementation is more re-usuable. I can always layer a template language and parser over the functional approach. But I can also experiment with other ease of use interfaces.
Best, beth
Update: moved discussion of WYSIWYG and templating to start of post.
| [reply] [d/l] [select] |
Re: Datastructures to XML
by zentara (Cardinal) on Mar 17, 2009 at 16:46 UTC
|
Wow, been looking how to approach that for awhile. I somehow knew it would take someone like Jenda to figure it out. Nested references in the hash values seem problemsome.
| [reply] |
|
|
I guess I can take this as a suggestion that it would be worth it to implement this, right? :-)
You mean if the data look like this:
[
{ 'lname' => 'Krynicky', 'fname' => 'Jenda',
'PageId' => 1,
'Details' => {
Header => 'blah blah',
Footer => 'bla bla bla',
}
},
...
];
? You could use a template like this for example:
<root>
<page cmd:foreach set:id="$_->{PageId}" opt:required>
<field cmd:foreachkey="! /^(PageId|Details)$/">
<cmd:insert set:_tag="$key" set:_content="$value"/>
</field>
<details cmd:forkey="Details">
<cmd:insert cmd:foreachkey set:_tag="$key" set:_content="$value"/
+>
</details>
</page>
</root>
That is you make sure the foreachkey in the <field> tag skips both the PageId and Details keys, include the <details> tag to present the values of the 'Details' key and finally foreachkey directly in the <cmd:insert> tag to print the data from the inner hash.
Is this what you meant? I assume writing the documentation for this will be quite hard. Just like the docs for XML::Rules were. And I don't think I did a perfect job there.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
|
|
|
Re: Datastructures to XML
by locked_user sundialsvc4 (Abbot) on Mar 18, 2009 at 15:55 UTC
|
To my way of thinking, “references” are not a concept that an external data-structure really knows about.
To see a shining example of what does work, we need look no further than the most-common external data structure of all: an SQL table database.
A database, of course, consists of a collection of tables. But the tables do not contain “pointers to” one another... as their hierarchial pointer-based ancestors did. The overall data relationship is described using a set of distinct tables, and where two the information in one table is semantically related to the information in another, the relationship is expressed in a commonality of key-values.
You can quite-easily do the same thing here. Store your information, not as one XML-tree, but several. Attach to each node a “primary key” of some kind... it could be a Data::GUID or some derivative thereof. Then, when one tree needs to refer to another, it does so by mentioning its “foreign key.”
The software that works with these trees could, if necessary, use a Perl hash to refer to them ... or, if the amount of data is known to be manageable, it could just use XSL queries. You can construct references in your Perl nodes, as long as you take care to make sure that they are all appropriately weak or strong. In either case, the entire “references problem” simply goes away with respect to your external data representation.
If your data representation is XML, then it is highly-desirable to take the time to make your XML well-formed. You gain a lot of benefits by describing a formal schema and sticking to it. The biggest of these benefits is that both the incoming and the outgoing data is “known good.” If the incoming data validates against its schema, you can rely upon that validation. If you are building a data structure that's supposed to conform to a schema, exceptions can be thrown if they don't ... ergo, if exceptions do not occur, the code that builds the data-structure must indeed be building a conformant structure. (Anytime you are looking for bugs in complex code such as this, it helps immensely to know where the bugs are (probably...) not.)
| |
|
|
Re the database example: I would consider this an implementation detail. It doesn't really make a difference whether you know the address of another object (in the general sense) in memory or its ID within some collection. Or (in case of N-N relations) a list of pairs of IDs of objects.
If I could store the data in any format I wished, it would be much easier, but sometimes I do not control the format. And sometimes even if I do it doesn't map to the data structure that best suits the needs of the task at hand directly.
I think I should have explained better what I am really after. I'd like to have a "reverse to XML::Rules". That is with XML::Rules I can tweak the tree structure of the data from a XML so that I can work more easily with the resulting structure. Where the original structure may be designed with a different task in mind or just be more general. Then I would like to have some reasonably simple way to "convert" the datastructure back to the original format. Or for that matter to a different format, but quite often one that was not designed for this particular task.
| [reply] |
|
|
It may be an implementation detail, but it is a very important one that can have a significant impact on a general purpose "data structure to XML" converter.
Sometimes data structures constructed in memory use pointers in place of ids. When this is converted to a persistent form (XML or otherwise), one must create some sort of id that corresponds to the pointer or reference. One will also need to decide on a name for the tag or attribute that holds the generated id since there is no corresponding array or hash element to "foreach". Otherwise there will be information loss.
In some cases one can just assign sequential ids. Other formats might require GUID generation. Others might want a registered URI. Still other XML formats require that the id match something in a database or flat file. To get the right id one might need to do a look up on a "soft" id - for example a person's first and last name or their social security or passport number. Or one might need to add a new record to the database and capture the id assigned by the database.
A second issue that I think sundialsvc4 was getting at was placement of XML elements. Both your template spec (and my functional alternative) assume a part-container model: elements nested within elements.
But sundialsvc4 is reminding us of an extremely important and common alternative: the relational model. In the relational model, big ugly objects aren't nested. They are replaced by foreign key fields. The XML for the big-ugly-object is defined elsewhere, perhaps even in a different file. The two may be connected either by matching field values (a la a relational DBMS) or by "references" - the value assigned to the id attribute of the big-ugly-object-in-another-file.
Because part-container models are easier to conceptualize, XML schemas often start life using a part-container model and then migrate over time to one that supports more of a relational model (less duplication of big-ugly-objects). For a readily available open source example, study the history of the XML format used with the ant build tool. Incidentally, the history of DBMS implementation also follows this progression (anybody remember CISC-ISAM databases?)
Any general purpose tool would be wise to support both (or clearly explain its limits in the CAVEATS section of its POD). Otherwise a company using Mondo::Wonderous::Data::XML might find that they have to throw out, rather than modify, their XML generation code as their XML schemas mature.
Best, beth
| [reply] |
|
|
It is not, strictly speaking, “an implementation detail.” When you are dealing with external data collections (be they SQL tables, or XML files or whatever), the notion of “addresses” (hence: references) does not exist. The notion of “keys,” of whatever format you wish, does.
If you have ever had the unfortunate experience of dealing with an IMAGE or an IDMS database in any past-life you'd much rather forget, then you will know exactly what I am talking about... :-D
(You do not, of course, have to answer that. Many I.S. memories are much better left buried in the past.)
| |