In the course of my programming life I've often had to deal with large codebases with hundreds of classes needing at least one and sometimes several different persistence formats and sometimes more than one in-memory format.
The need to support multiple formats arises from many directions:
Managing multiple formats for one class of objects can be tedious. Managing multiple formats for entire families of objects can quickly get out of hand.
Frequently there is a good deal of nearly duplicate code with minor changes to suit each format. Making changes in several almost-alike modules practically begs for human error. Human error aside, I am especially concerned about the amount of production and test code involved in supporting so multiple formats. Anything I can do to reduce it is a major plus both personally and on the balance sheet. The less code, less people needed to maintain it: even the fastest programmer can only read so fast.
Controlling the amount of code is also an issue for my personal projects. I can only manage so much code effectively. Anything I can do to keep the volume of code down increases the complexity of projects I can manage on my own.
Recently, I needed to add human editable YAML support to a group of objects. YAML does a nice job of dumping object graphs, and it even has a class YAML::Marshall that you can use to add a custom dump and load method to your class. However, the resulting YAML is not pretty: making end users enter !!perl/hash/:... for each object is not my idea of user friendly. The only way to make it friendly was to unbless the object graph and dump it as a plain set of hashes. Then when reloading it, I would have to rebless everything.
As I planned the design for YAML persistence, I realized that there was an underlying pattern that was common to almost all of my data transformations, especially those that involve writing things out to an intermediate format. When dumping data, the pattern goes something like this:
When loading the data, we get a similar pattern in reverse:
Now I realize this is a very, very general pattern, but even very, very general patterns need code to manage their flow and in this case, much of this code is pro-forma. What is even more important is that a lot of the code involved in massaging the data to be friendly to serialization tools when dumping (or programmers when loading) is also pro-forma. This is especially true when navigating object graphs. I seem to be writing the same navigation code and checks over and over.
Rather than reinvent the wheel over and over for slightly different data, I began exploring different ways of eliminating pro-forma code. Out of this developed a system of composable data conversion tools that range in complexity from single strings with the name of a class to functions and closures to composite structures that act as templates to define navigation paths for bottom up and top-down data conversion.
I've collected the code for managing the load/dump pattern and its various conversion rules into a module that I'm considering for release on CPAN. The module is tentatively named: Data::Morph.
I've posted the code and documentation below in a separate node hoping that my fellow monks will be willing to give me some feedback on the concept, name, code, documentation, test strategy, and its suitability for CPAN. This would be my first ever CPAN module, so I especially need the feedback.
Many, many thanks in advance, beth
Update: the remaining portion of this post has been moved to a node below, as per the gracious suggestion of jdporter. The intent was to hid the pod and code so that the request for comment would close this post, but as readmore tags explode when you click through directly, that does not happen for most readers. My apologies to any reader who was overwhelmed by the length of this post.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: Managing data in multiple formats
by ELISHEVA (Prior) on Apr 01, 2009 at 00:39 UTC |