RFC: Managing data in multiple formats

In the course of my programming life I've often had to deal with large codebases with hundreds of classes needing at least one and sometimes several different persistence formats and sometimes more than one in-memory format.

The need to support multiple formats arises from many directions:

multiple use cases: data dumps for debugging, pretty user friendly formats that are easy for end users to edit, serialization for interprocess communication to name the most common.
internal format changes: sometimes the internal format of data changes in ways that have no real effect on configuration data. I want to keep the configuration and data files the same, but the internal format no longer matches the data structures implied by the end user format.
backwards compatibility: configuration and data files have changed to support new features, yet I still want (or need) to support older data file formats that may no longer match the internal data structures.
prototyping: several different formats are being considered, but it isn't clear which one will really work out best for use case X. Clients and developers can sometimes have a hard time envisioning how each file would work for them unless they see sample. This is especially true if the data stored in the data files is complex and richly interrelated. However, such complex data is also the hardest to hand craft load and dump routines for. Thus prototyping is most expensive precisely when we need it most. If there were a cheap and easy way to try out multiple formats it could take a lot of the guess work out of these choices.
role-based tools: a lot of serialization functionality happens after the serialization (encryption, compression, inter process communication being just a few examples). Although these tools don't need multiple formats for their own family of objects, they do benefit from the existence of load and dump tools that can handle a wide range of relationships between in-memory data structures and serialization formats. The more capable a load or dump method is, the wider the range of objects it can load and dump. The wider the range of objects these tools can successfully load and dump, the more valuable these role-based tools become.

Managing multiple formats for one class of objects can be tedious. Managing multiple formats for entire families of objects can quickly get out of hand.

Frequently there is a good deal of nearly duplicate code with minor changes to suit each format. Making changes in several almost-alike modules practically begs for human error. Human error aside, I am especially concerned about the amount of production and test code involved in supporting so multiple formats. Anything I can do to reduce it is a major plus both personally and on the balance sheet. The less code, less people needed to maintain it: even the fastest programmer can only read so fast.

Controlling the amount of code is also an issue for my personal projects. I can only manage so much code effectively. Anything I can do to keep the volume of code down increases the complexity of projects I can manage on my own.

Recently, I needed to add human editable YAML support to a group of objects. YAML does a nice job of dumping object graphs, and it even has a class YAML::Marshall that you can use to add a custom dump and load method to your class. However, the resulting YAML is not pretty: making end users enter !!perl/hash/:... for each object is not my idea of user friendly. The only way to make it friendly was to unbless the object graph and dump it as a plain set of hashes. Then when reloading it, I would have to rebless everything.

As I planned the design for YAML persistence, I realized that there was an underlying pattern that was common to almost all of my data transformations, especially those that involve writing things out to an intermediate format. When dumping data, the pattern goes something like this:

Take an in-memory programmer friendly data structure
Choose a serialization format (and CPAN module)
Convert the in-memory data into a format that is easy for the serialization module to use.
Serialize the data using the module

When loading the data, we get a similar pattern in reverse:

Identify the serialization format and choose a CPAN module to support it.
Use the module to deserialize the data
Convert the data into a programmer-friendly form

Now I realize this is a very, very general pattern, but even very, very general patterns need code to manage their flow and in this case, much of this code is pro-forma. What is even more important is that a lot of the code involved in massaging the data to be friendly to serialization tools when dumping (or programmers when loading) is also pro-forma. This is especially true when navigating object graphs. I seem to be writing the same navigation code and checks over and over.

Rather than reinvent the wheel over and over for slightly different data, I began exploring different ways of eliminating pro-forma code. Out of this developed a system of composable data conversion tools that range in complexity from single strings with the name of a class to functions and closures to composite structures that act as templates to define navigation paths for bottom up and top-down data conversion.

I've collected the code for managing the load/dump pattern and its various conversion rules into a module that I'm considering for release on CPAN. The module is tentatively named: Data::Morph.

I've posted the code and documentation below in a separate node hoping that my fellow monks will be willing to give me some feedback on the concept, name, code, documentation, test strategy, and its suitability for CPAN. This would be my first ever CPAN module, so I especially need the feedback.

Many, many thanks in advance, beth

Update: the remaining portion of this post has been moved to a node below, as per the gracious suggestion of jdporter. The intent was to hid the pod and code so that the request for comment would close this post, but as readmore tags explode when you click through directly, that does not happen for most readers. My apologies to any reader who was overwhelmed by the length of this post.

Comment on RFC: Managing data in multiple formats Download Code

Replies are listed 'Best First'.
Re: RFC: Managing data in multiple formats by ELISHEVA (Prior) on Apr 01, 2009 at 00:39 UTC
The text below was moved from the original post so it doesn't obscure the request for comment above. Existing CPAN modules and test plan: Read more... (2 kB) The POD documenttion (Data/Morph.pod) Read more... (37 kB) The module code (Data/Morph.pm) Read more... (15 kB)	[reply] [d/l] [select]