Re: RFC: Managing data in multiple formats

The text below was moved from the original post so it doesn't obscure the request for comment above.

Existing CPAN modules and test plan:

Modules currently on CPAN CPAN has many, many modules dedicated to data serialization and transformation. The closest thing I've found to this module is Data::Serializer. That module provides a common interface for dumping and loading to various serialization formats, but it doesn't provide any sort of systemic support for situations where the internal programmer-friendly data structure differs from the data structure that is actually serialized. This issue is discussed in more depth in the POD documentation below.

Testing strategy The Data::Morph module has less than 500 lines of code, but it does a lot in those 500 lines and it requires particularly intensive testing. The test suite currently contains 882 tests covering

5 different serialization modules: Storable, Data::Dumper, YAML, JSON, and Data::Serializer
14 different data samples including strings, numbers, arrays, hashes, blessed objects backed by both arrays and hashes, circular references, multiple references to the same array, hash, or objects, and deeply nested data structures containing a mix of hashes, arrays, objects, and pure scalars
at least one example of each supported rule type

Combinations of module, data sample, and rule been tested to verify that (a) the dump string matches an expected value, (b) loading the expected dump string generates the original internal representation, and (c) the process of dumping does not modify the original data.

I feel especially concerned about the documentation. Without good documentation, this module is nearly useless. The documentation is rather long. It tries to cover the concept behind the module, the role of the module in light of the many, many CPAN modules for data conversion and manipulation, and, of course the specifics of defining data transformations using this module. Despite its length I suspect that many things are still unclear. I also wonder if I need all of the material I've added to give the module context.

The POD documenttion (Data/Morph.pod)

=head1 NAME

B<Data::Morph> - manage deeply nested data and compound objects
with multiple serialization and in-memory formats.

=head1 SYNOPSIS

Associate a serialization protocol with a C<Data::Morph> object:

  my $oYAML = Data::Morph->new('YAML');
  my $oConverter = Data::Morph->newCustom(\&freezeMe, \&thawMe);
  my $oSerializer = Data::Morph->newSerializer(serializer=>'XML');

Define a rule for converting in memory data to and from a format,
using one of the following rule definition approaches:

  # bless the data when you read it in
  # strip the data when you read it out

  $rule = 'Mine::Foo';

  # pass the data to a constructor when you read it in
  # strip the blessing when you read it out

  $rule = 'Mine::Foo->new;

  # select data from a hash when you read it in
  # strip the blessing when you read it out

  $rule = 'Mine::Foo->new(name,age)';

  # select data from an array when you read it in
  # strip the blessing when you read it out

  $rule = 'Mine::Foo->new(2,4..,3,1);

  # define your own rule for reading data in and out

  sub myConvertFunc {
    my ($xData, $bLoad, $hReplace, $aPath) = @_;
    ... do stuff ...
    return $xConverted;
  }
  $rule = \&myConvertFunc;

  # read data in bottom up using a rule of your choice
  # dump data top down possibly using a different rule

  $rule = Data::Morph::Rule($load, $dump, $rule, $default);
  $rule = Data::Morph::makeRule($load, $dump, $rule, $default);

  {
    use Data::Morph qw(makeRule);
    $rule = makeRule($load, $dump, $rule, $default);
  }

Then use the object to dump and load data:

  # To dump the data ($rule is optional)

  $oYAML->dump($someData, $rule);
  $oCustom->dump($someData, $rule);
  $oYAML->freeze($someData, $rule);       #alias for dump
  $oYAML->serialize($someData, $rule);    #alias for dump

  # To load the data ($rule is optional)

  $oYAML->load($someData, $rule);
  $oCustom->load($someData, $rule);
  $oYAML->thaw($someData, $rule);         #alias for load
  $oYAML->deserialize($someData, $rule);  #alias for load

  # For additional methods, see L</Converting data>.

=head1 DESCRIPTION

This module provides tools that will help you manage data that
needs to have multiple serialization or in-memory representations.
Its goal is to remove the gruntwork and pro-forma flow of control
code so that you can focus more closely on the actual formats and
conversion logic you want.

First, what this module does I<not> do.  It does not automagically
convert any data to any other data.  Any tool that tried to do that
would be so general and bloated that it would be the equivalent of
hammering a nail with a jack-hammer.

Rather it provides a framework for managing the pipeline from
object to data massaging routine to serialization module and back
again. It also provides several tools for (a) defining how to
load and dump data from a particular serialization format
(b) massaging data before dumping and after loading.

=head2 Why multiple formats for the same data?

CPAN provides support for a wide variety of serialization formats,
but not all of these modules can deliver data in exactly the format
you want.  That means you need to massage the incoming and outgoing
data.  Massaging this data means that you need at least two formats
in addition to your normal programmer-friendly in-memory
representation: the intermediate massaged format and the actual
serialized form of the data.

The need to support multiple formats arises from many other
directions as well:

* multiple use cases: data dumps for debugging, pretty user
friendly formats that are easy for end users to edit, serialization
for interprocess communication to name the most common.

* internal format changes: sometimes the internal format of data
changes in ways that have no real effect on configuration data.
I want to keep the configuration and data files the same, but the
internal format no longer matches the data structures implied
by the end user format.

* backwards compatibility: configuration and data files have 
changed to support new features, yet I still want (or need)
to support older data file formats that may no longer
match the internal data structures.

* prototyping: several different formats are being considered,
but it isn't clear which one will really work out best for
use case X.  Clients and developers can sometimes have a
hard time envisioning how each file would work for them
unless they see sample.  This is especially true if the 
data stored in the data files is complex and richly
interrelated.  However, such complex data is also the hardest
to hand craft load and dump routines for.  Thus prototyping
is most expensive precisely when we need it most.  If there
were a cheap and easy way to try out multiple formats it
could take a lot of the guess work out of these choices.

* role-based tools:  a lot of serialization functionality
happens after the serialization (encryption, compression,
inter process communication being just a few examples).
Although these tools don't need multiple formats for their
own family of objects, they do benefit from the existance
of load and dump tools that can handle a wide range of
relationships between in-memory data structures and
serialization formats.  The more capable a load or dump method
is, the wider the range of objects it can load and dump.
The wider the range of objects these tools can successfully
load and dump, the more valuable these role-based tools
become.

=head2 Alternatives to C<Data::Morph>

Although CPAN has many, many modules for converting data from one
format to another, all of the modules I was able to find had one
or more of following limitations:

* they can handle serialization for only a limited set of data
  structures.  For example, L<Data::Any|>, L<Data::Table|>, and
  L<Tangram> each handle a variety of serialization formats but
  they expect in-memory data to be unblessed hashes or arrays
  of hashes.  They treat the values assigned to hash keys as
  strings, numbers, or blobs even if the serialization format
  (e.g. XML) permits deeply nested data structures.

* they provided support for the first part of the pipeline, but
  not the second. Although various serialization formats are
  supported, any data massaging happens before or after the use
  of the module. Examples of CPAN modules with this limitation
  include L<Data::Serializer|>, L<Data::Any|>, L<Data::Table|>,
  among others.

* they support massaging of data in memory, but this only applies
  to blessed objects and only one such massage function is
  supported per class.  For example, L<JSON> and L<YAML> both
  support class specific load and dump routines. L<JSON> looks
  for a method named C<TO_JSON>. L<YAML> looks for methods
  named C<yaml_dump> and C<yaml_load>.  L<Pixie> and L<Data::Freezer|>
  which is based on it also take this approach to customization.

C<Data::Morph> is designed to pick up where each of these modules
leave off.  It is intended as a complement to these modules rather
than a replacement. It knows how to work together with these modules
so that solutions that were initially implemented
using these modules can be continue to use that implementation
together with C<Data::Morph>.

=head3 Scalability issues with per-class serialization methods

One significant advantage to using C<Data::Morph> as a framework
for your data transformations is that it gives you choices that
can make the growth of your code base easy to manage.

Class specific load and dump routines can be quite convenient if
one is dealing with a small library of objects, but it may not
scale well as a code base grows.  Maintaining a set of such
routines across tens or hundreds of objects can get confusing
quite quickly.  Such a large number of routines might be easier
to manage if they were centralized in some common store.  This is
not an option when modules limit customization to methods defined
within a class.  C<Data::Morph> is a bit more scalable because
you have full control over where you define your data massaging
rules. You can define your rules within the class's module if
you wish.  If that becomes unweildy, you can move your rules to
a central place, if that would be more convenient.

As classes mature and get reused, the per-class customization
strategy runs into another problem: there can only be one such
method for a class.  This can cause difficulties if there are
multiple use cases for a particular serialization format.  For
example, one might sometimes use YAML for IPC and sometimes
use YAML for a user friendly layout. At yet other times YAML may
be used for debugging dumps.

IPC demands compactness and efficiency and often replicates the
in-memory representation of the data without any massaging.  If
the data is only being used in memory, integer codes can easily
be used in place of strings.

On the other hand user friendly layouts generally need much more
verbosity.  Integer codes need to be translated into human readable
strings.  Data structures that would generate a lot of arcane
syntax may need to be replaced with data structures that can
be represented with a subset of human friendly syntax.

Debugging dumps usually require something between the verbosity
of a human friendly data file and a dense IPC string.   A single
dump_YAML method would have difficulty supporting all three of
these use cases: IPC, debugging output, and user friendly data
files.

With C<Data::Morph> you have the option of using either the
class's own dump and load rules or an external rule that is
imposed on the class from the outside.  It does not look for
a method with a specific name, so you can define as many per-class
methods for a serialization format as you need. Furthermore,
if you don't have rights to change the class definition, you
can define rules and functions for massaging the data that are
completely external to the class.  C<Data::Morph> can use those
as easily as it uses the per-class methods.

=head2 The data conversion pipeline

The framework divides the conversion process into two phases:

* Phase 1: conversion to/from serialization format to a Perl data
structure that exactly parallels the serial format.  We will call
this the "outer" Perl data format.  Typically this phase is
configured by passing the name of an existing CPAN module or
by passing references to functions defined in existing CPAN
modules.

* Phase 2: customized conversion from the "outer" Perl data format
to an internal programmer centric Perl data format.  We will call
this the "inner" Perl data format.  Its purpose is to customize
the output of standard modules to a form more suitable to the
programmers needs.  Because the customization is highly application
specific, it is configured with a programmer defined rule rather
than "canned modules".

=head3 Defining phase 1: serialization format

Phase 1 takes care of serializing and deserializing data.
To implement the serialization, C<Data::Morph> needs to be
given two subroutines:

* a dump routine that accepts any scalar or perl reference
  as its sole parameter

* a load routine that accepts a string or other stream
  as its sole parameter.

These two routines are defined when the C<Data::Morph> object is
constructed.

=head4 Data::Morph object constructors

The serialization format used by a C<Data::Morph> object is
chosen when the object is constructed.  There are three different
constructors defined: C<new(...)>, C<newSerializer(...)> and
C<newCustom(...)>.

=over

=item C<new(...)>

Selects dump and load routines using the names of modules and
functions.

  # $sModule      the name of a module that dumps and loads data
  #               Required
  #
  # $sDump        the name of a function within $sModule. The
  #               function should take the same parameters as
  #               YAML::Dump or Data::Dumper, i.e.
  #               &$sDump($xData);
  #
  #               Optional, defaults to 'Dump'
  #
  # $sLoad        the name of a function within $sModule. The
  #               function should take the same parameters as
  #               YAML::Load, i.e. &$sLoad($xData);
  #
  #               Optional, defaults to 'Load'

  $oConvert = Data::Morph->new($sModule, $sDump, $sLoad);

=item C<newSerializer(...)>

   $oConvert = Data::Morph->newSerializer(key => val, ...);

C<newSerializer(...)> converts data using an instance of
L<Data::Serializer>.  The the parameters passed to the constructor
are used to build the L<Data::Serializer> object.

=item C<newCustom(...)>

This method gives one the fullest possible control over conversion.
You provide code references to subroutines for dumping and loading
data.

   # $crDump      a code reference to a subroutine or closure
   #
   # $crLoad      a code reference to a subroutine or closure

   $oConvert = Data::Morph->newCustom($crDump, $crLoad);

=back

=head4 Recipes for specific serialization formats

CPAN provides a number of excellent modules for converting
back and forth between data structures to strings, so you
won't normally need to write your own functions.

=over

=item L<YAML>

Since L<YAML> provides a C<Dump(...)> and C<Load(...)> method
we can just pass the name of the module - it will be
automatically required:

   my $oConvert = Data::Morph->new('YAML');

=item L<JSON>

JSON is an alternative to YAML.  It is designed to capture data
using very simple data types: scalars, arrays, hashes and
combinations of the two.  It does not have a way to encode class
identity in the string output.

Out of the box, the JSON module can only accept arrays and hashes
as input.  Other kinds of input (classes, pure scalars) will cause
it to fail.  It also dumps hashes in random order. This can make
it difficult to compare dumped files.  To fix these issues one
must call the dump and load routines with an option hash. We also
need to set the C<-convert_blessed_universally> flag. Normally,
JSON only unloads blessed objects that have a TO_JSON method. If
we use JSON with C<-convert_blessed_universally> then JSON provides
a default definition for that method.

To use configured calls with <Data::Morph>, we wrap them
in subs and build the <Data::Morph> object with the C<new_custom>
constructor:


  use JSON qw(2.14 -convert_blessed_universally);
  sub jsonDump {
     return JSON::to_json(shift, {canonical=>1, allow_nonref=>1
                                  , allow_blessed => 1
                                  , convert_blessed => 1});
  }

  sub jsonLoad {
    return JSON::from_json(shift, { allow_nonref => 1} );
  }

  my $oConvert = Data::Morph->newCustom
       (\&jsonDump,\&jsonLoad);

=item L<Data::Dumper|>

L<Data::Dumper> provides a dump routine; you must however
provide your own load routine:

  use Data::Dumper;
  sub evalDump {
    my $sEval = shift;
    my $VAR1;
    return eval($sEval);
  }

  my $oConvert = Data::Morph->newCustom
       (\&Data::Dumper::Dump,\&evalDump);

=item L<Storable>

L<Storable> provides C<freeze(...)> and C<thaw(...)> methods
for serializing data.  Since these methods expect a reference
to the data rather than the data, we must wrap their calls in
subroutines and use the C<newCustom(...)> constructor:

  use Storable;
  sub freezeBinary {
    my $xData=shift;
    # pass reference
    return Storable::freeze(\$xData);
  }

  sub thawBinary {
    my $xData=shift;
    # dereference data
    return ${Storable::thaw($xData)};
  }

  my $oConvert = Data::Morph::newCustom(\&freezeBinary
                                       ,\&thawBinary);

More elaborate uses of L<Storable> are also possible. For
example, if one passed references to closures one could
build a converter that read and wrote to a specific file:

  use Storable;
  my $sOutfile = 'foo.dat';
  my $crDump = sub {
    my $xData=shift;
    return Storable::nstore(\$xData, $sOutfile);
  }
  my $crLoad = sub {
    my $xData=shift;
    return ${Storable::retrieve($sOutfile)};
  }
  my $oConvert = Data::Morph::newCustom($crDump,$crLoad);

=item L<Data::Serializer|>

The L<Data:Serializer> provides a standard interface for data
serialization, encryption, and compression.  Because of the
large number of serialization formats supported by this module,
C<Data::Morph> provides a constructor dedicated to this module.
It accepts any parameters that would normally be passd to
the constructor of L<Data::Serializer> and constructs a serializer
object with those parameters:

  $oConvert = Data::Morph->newSerialier(serializer => 'YAML');
  $oConvert = Data::Morph->newSerializer(serializer => 'XML');
  $oConvert = Data::Morph->newSerializer($serializer => 'JSON');

=back

=head3 Defining phase 2: conversion rules

This module supports rules for converting data from streams
and files containing both single and multiple objects.

=head4 Handling of complex data

Developing data massage routines can get tricky when data is
deeply nested or involves shared references, circular references,
or part-container relationships.  Making it easier to safely and
correctly massage such data is one of the goals of the
C<Data::Morph> package.  Its tools for defining data
transformations is integrated with a data navigation engine that
automatically performs the following tasks:

* Automatic detection of circular references (and prevention
  of infinte loops caused by attempts to navigate them)

* Protection of shared references, even when transformations cause
  data to be copied.

* Preservation of the original data.  The transformation leaves
  the source data unchanged even if the source data includes
  numerous references.

=head4 Data transformation using string rules

String conversion rules let you define a blessing or constructor
that should be used to create objects.  It is primarily used to
dump and load data that is serialized as unblessed objects. You
can use a string rule to

* bless loaded data into a class

* pass the loaded data as is to a constructor of your choice

* massage the loaded data into a set of parameters to be passed
  to the constructor.

The string conversion rule has a very simply implicit dump
rule: it strips the blessing from any blessed array, hash, or
scalar reference.  This dump rule should be sufficient if your
constructor merely cleans up loaded data: checks for bad data,
normalizes the representation of values, and sets defaults.

However, constructors sometimes modify the basic arrangement
of data - converting incoming hashes into inside out objects,
or using the data to generate new arrays and hashes with
specialized key names.

If the object is meant to be opaque and has a fundamentally
different structure from the loaded data, then a dump routine
that just unblesses the data won't be enough.  You will need
to define a custom dump routine.  To couple a string rule with
a custom dump routine, you must create an instance of
C<Data::Morph::Rule>. Please see the section below for details.

   # blesses $xData on load
   # strips blessing on dump

   $rule = "Animal::Lion";
   $xData = $oConvert->load($sData, $rule);
   $oConvert->dump($xData, $rule);


   # on load: converts $sData to a Perl data structure $xLoaded
   # calls Animal::Lion->new($xLoaded)
   # on dump: strips blessing

   $rule = "Animal::Lion->new"
   $xData = $oConvert->load($sData, $rule);
   $oConvert->dump($xData, $rule);


   # on load: massages $xLoaded into a parameter list
   #          Animal::Lion->new($xLoaded->{furColor}
   #                            , $xLoaded->{pawSize});
   # on dump: strips blessing

   $rule = "Animal::Lion->new(furColor,pawSize)";
   $xData = $oConvert->load($sData, $rule);
   $oConvert->dump($xData, $rule);


You can also use the parameter list to reorder an array
of parameters.  You may use any parameter list you would
use to create an array slice, e.g. C<< @aData[1..4,3,2] >>. In
addition, one can use C<< N.. >> to refer to all parameters
from N to the end of the array inclusive.

   # reorders array parameters
   # N.. means insert parameters from N to end of array

   $rule = "Animal::Lion->new(2,3,1,0);
   $rule = "Animal::Lion->new(2..,1,0);

=head4 Array conversion rules

You can define a rule that applies to all elements of an
array by defining a one element array reference containing
that rule. Any kind of rule (string, hash, C<Data::Morph::Rule>
object, or function) may be used this way.

The rule will I<only> be applied to array elements:

  # apply the rule to all elements of an array

  my $rule = [ $ruleForEachElement ]

=head4 Hash conversion rules

You can define a rule for specific hash keys, by creating a hash
whose values are rules.  As with array rules, any kind of rule
can be used this way.  A hash rule looks like this:


  # apply $rule1 to the hash key batman
  # apply $rule2 to the hash key robin
  # no rule defined for other keys ( Phase II is a no-op)

  my $rule = { batman => $rule1, robin => $rule2 }

=head4 Massaging data with functions

Instead of a string you can define a function to handle the
conversion of the Phase I data.  Function rules are typically
used in one of two very different ways: simple data cleanup
that needs to be applied to all members of an array or specific
hash keys and complex object conversions.

There are two ways to use functions in data transformation rules.
First, you can write a function and use it to construct a
L<Data::Morph::Rule> object or you can use it as a rule in its own
right.  This section focuses on how to decide which use is best
for you and explores some issues to consider if you choose to
use a function as a rule in its own right.

A function acting as a rule in its own right, has up to four
parameters.  Unless you are doing something particularly fancy
with nested data and references, you can probably ignore the last
two of these: C<$aPath> and C<$hReplace> parameters.

   $rule = sub {
      my ($xData, $bLoad, $hReplace, $aPath) = @_;
      if ($bLoad) {
         #massage $xData into $xInner
         return $xInner;
      } else {
         #$xData is inner (programmer) representation
         #massage $xData into $xOuter
         return $xOuter;
      }
   }

   my $sData = $oConvert->dump($xData, $rule);
   my $xReloaded = $oConvert->load($sData, $rule);

=head5 Function rules used for simple data cleanup

Function rules used for simple data cleanup are usually applied
to all elements of an array or to selected hash keys.  For this
reason they are most often found embedded in array and hash
rules.  For example, if you were reading in a YAML string that
defines an array of message template strings, you might want one
encoding inside your program and another in the dumped string.

To apply the encoding to each element of the array, you might
define a function rule like this:

    use Encode;

    sub recode {
       my ($xData,$bLoad) = @_;
       return $bLoad
          ? decode("utf8", $xData)
          : encode("utf8",$xData);
    };

    # [ ... ] says apply the rule to each array element
    $rule = [ \&recode ];

    $aUpperCaseData = $oConvert->load($sData, $rule);
    $sLowerCaseData = $oConvert->dump($aUpperCaseData, $rule);

=head5 Using function rules to dump complex object graphs

Another way to use functions is to pass them in a
C<Data::Morph::Rule>.  If you have an object that stores references
to hashes, arrays, and other objects, your objects collectively
form an object graph.  C<Data::Morph> can navigate that graph if
you configure a C<Data::Graph::Rule> object for
top-down dumping/bottom-up loading.  See C<Data::Morph::Rule> for
more information.

You can, of course, also define a function and use it alone for
the dumping and loading.  However, if you do that you will have to
manually handle all of the graph navigation - including insuring
that (a) transformations do not alter the original data (b) objects
that share references continue to share references after they are
transformed (c) your load and dump routines do not end up in
infinite loops because of circular references.

If you choose to completely manage the data navigation involved in
dumping and loading, and you want your routine to play nicely with
other objects that are being dumped, you will need to know how to
use the C<$hReplace> and C<$aData> parameters of the data
transformation function.

C<$aPath> stores the path of references that have been navigated
so far.  You can generally ignore it unless your custom function
is going to navigate further into the depths of C<$xData> and you
are concerned about circular chains of references.

C<$hReplace> keeps track of references that contain transformed
data.  When data is transformed, it should leave the original data
unchanged.  This means that an array, hash, or object reference
storing transformed data needs to be copied and then transformed
rather than being modified in place.

Sometimes data structures store the same reference in multiple
places. The fact that reference X is used in three places may
be significant.  If one copies a hash or array, one must also
find all the other places where that hash or array is used and
change them to the reference of the new copied object.

C<Data::Morph> keeps track of all of the copied references
in the C<$hReplace> array. If you need to copy an array or hash
and care that other places that reference it also get changed,
then you must add an entry for C<$hReplace> for each array or
hash reference you copy.  The key of C<$hReplace> is the value
returned by passing the reference to L<Scalar::Util::refaddr>.
The value is the new reference.

=head5 Using function rules to dump simple objects

If your object is quite simple (e.g. all values are scalars),
the choice of using a function in its own right or as part of a
L<Data::Morph::Rule> object is pretty much a matter of style.

For example, suppose you have a detailed database for books.  Your
application lets each user keep a list of their favorite books.
The list contains only the name, but when you read load it into
memory you look up the name of each book in a database. When you
dump it you, just want the name.  So even though the data inside
the object may be very complex, the data you actually dump and load
is just a simple string.

For a case like this there are no complex object graphs to navigate
so you can handle the whole dump and load process quite easily in
a function.  might want to define a simple function rule like this:

    sub dumpOrLoadABook {
       my ($xData,$bLoad) = @_;
       if ($bLoad) {
         ... look up book with favorite DBx module
         return $oBook;
       } else {
          return $xData->getBookName();
       }
    };

    # [ ... ] says apply the rule to each array element
    $rule = [ \&dumpOrLoadABook ];

    $aUpperCaseData = $oConvert->load($sData, $rule);
    $sLowerCaseData = $oConvert->dump($aUpperCaseData, $rule);

But even when your, you still might want to use a
C<Data::Morph::Rule> object.  A rule object would let you take
advantage of the fact that the dump routine only needs a getter
method.  It also lets you define a dedicated load routine:

   sub loadBook {
     my $xData = shift;
     ... look up book with favoriate DBx module
     return $oBook;
   }

   $rule = [ Data::Morph::Rule->new(\&loadBook, 'getBookName') ];

=head4 C<Data::Morph::Rule>

If you would like something between the full control of a
function and a simple string rule, you can define a rule object.

  # $load    a string or function rule, but used only for
  #          loading.
  #
  #          This parameter is optional. If missing the load rule
  #          is a no-op, i.e.
  #          sub { my $xData=shift; return $xData; }
  #
  # $dump    One of four possible values:
  #
  #          - a string rule (see above)
  #          - a function rule (see above)
  #          - the name of a method to call on the data
  #            being converted. The data being converted is
  #            assumed to be an object. The method shoul expect
  #            the following parameters:
  #
  #            sub {
  #               my ($hReplace, $aPath) = @_;
  #               # convert the data
  #               return $xConverted;
  #            }
  #
  #          - the empty string - causes the data to be
  #            returned as is, with no conversion.
  #
  #          This parameter is optional. If missing the dump rule
  #          is the same as a string rule, i.e. if the data is a
  #          blesse reference, it makes a shallow copy of the
  #          referenced data but omits the blessing from the copy.
  #
  # $rule    a rule for bottom up loading and top down dumping.
  #          During the load process, the rule is used to prepare
  #          the data before passing it to the load rule.  When
  #          data is unloaded the rule is applied to the
  #          *output* of $dump function.
  #
  #          This parameter is optional. If missing,
  #          the data will be passed to the load rule as is.
  #
  # $default the default key name.  If present, any scalar
  #          data being loaded is converted to a hash
  #          reference where { $default => $xData }


  Data::Restruture::Rule->new($load, $dump, $rule, $default);

The C<$load> and C<$dump> parameters let one provide entirely
independent logic for loading and dumping data. C<$rule> and
C<$default> are used to massage the data passed to
C<$load>.

C<$rule> is used to primarily for bottom up processing of
deeply nested data structures.  For example, when we read in
the following YAML data using out-of-the-box functionality,
we will get a hash with embedded array references which
in turn have embedded hashes (HoAoH):

  ---
  orderNumber: 123
  lines:
    - product:  trombones
      quantity: 76
    - product:  cornets
      quantity: 110
  ...

Now suppose we have an order object constructor that expects
an array of order line objects as one of its parameters. We
can't pass the YAML data as is because that data represents
the order lines as an array of hashes rather than an array
of order line objects.

To solve this problem, we need to do bottom up conversion of
data. First we need to convert the deeply nested order line
hashes to order line objects.  Only after this conversion can
we pass the data to the object constructor.

We can use the C<$rule> parameter of a C<Data::Morph::Rule>
object to define such a bottom up conversion rule:

   #convert an order line into instances of Acme::OrderLine
   $oOrderLineRule = "Acme::OrderLine->new(product,quantity)";

   # prepare data by converting all order lines into objects
   # for an explanation of {...} and [...] see the section
   # below on files and streams containing multiple objects

   $oPrepData = { lines => [ $oOrderLineRule ] };

   # call this constructor after preparing the data
   $sLoadAndDump = "Acme::Order::new(orderNumber, lines)";

   # now put it all together into a single rule that preps
   # the data and loads it
   $oOrderRule = Data::Morph::Rule->new
       ( $sLoadAndDump, undef, $oPrepData);

Now we can use the rule to dump and load order objects, like
this:

   $aData = $oConvert->load($sData, $oOrderRule);
   $sData = $oConvert->dump($aData, $oOrderRule);

=head4 Composing small rules into larger ones

The tools for building conversions rules are designed to be
composable.  That is, you can define a rule for a simple object
and the make it part of a more complex rule for more complex
objects.

The arrays, hashes, functions and objects storing rules can be
nested inside one another to create increasingly complex conversion
rules.  For example, suppose you wanted to convert a YAML string
like the one below into an internal representation containing
author and citation objects built with custom constructors:

 ---
 authors:
   - Emily Dickenson
   - William Shakespeare
   - Charlotte Bronte
   - Langston Hughs

 citations:
   - work:   Romeo and Juliet
     author: William Shakespeare
   - work:   Wuthering Heights
     author: Emily Bronte
   - work:   Hold Fast to Dreams
     author: LangstonHughs
 ...

Out of the box, C<YAML::Load> would set up a hash containing
two arrays.  But suppose instead we wanted each author and
each citation to be an object? The following code would insure
that the conversion rules were properly applied to each instance
of author and citation in the YAML file.

  # define a rule for converting author name strings to
  # author objects

  my $ruleAuthor = Data::Morph::Rule
     ("Foo::Author->newFromString"
      , sub { return shift->getAuthorName(); });

  # define a rule for converting citation hashes to citation
  # objects

  my $ruleCitation = "Foo::Citation->new(work,author)";


  # * applies the the author rule to each element of the array
  #   assigned to the authors key
  # * applies the citations rule to each element of the array
  #   assigned to the citations key

  my $rule = { authors => [$ruleAuthor]
               , citations => [$ruleCite]
             }
  $oYAML->load($someData, $rule);

=head2 Doing only Phase I or Phase II

Serialization generally requires both phases, but there may be
times when only one of the two phases is needed by an appliation.
The C<Data::Morph> object provides separate functions for
each phase so you can do one or both phases, as need be:

   # Phase 1 - load
   # load data into default Perl data structures defined by
   # serialization modules

   $oConvert->loadOuter($xData);
   $oConvert->thawOuter($xData);               #alias
   $oConvert->deserializeOuter($xData);        #alias

   # Phase 2 - load
   # convert Perl data generated by serialization modules to
   # programmer-friendly representation.

   $oConvert->loadInner($xData, $rule);
   $oConvert->thawInner($xData, $rule);        #alias
   $oConvert->deserializeInner($xData, $rule); #alias

   # Phase 2 - dump
   # dump from programmer representation to data structures
   # easily understood by serialization modules

   $oConvert->dumpInner($xData, $rule);
   $oConvert->freezeInner($xData, $rule);      #alias
   $oConvert->serializeInner($xData, $rule);   #alias

   # Phase 1 - dump
   # serialize data from easily understood format

   $oConvert->dumpOuter($xData);
   $oConvert->freezeOuter($xData);             #alias
   $oConvert->serializeOuter($xData);          #alias

=head2 Transformation chains

So far we have focused on conversions back and forth from string
to programmer data.  However, C<Data::Morph> can also be
used to transform data from one programmer representation to
another by chaining together calls to C<dumpInner(...)> and
C<loadOuter(...)>:


  my $xMiddle = $oConvert->dumpInner($xStart, $oRule1);
  my $xEnd = $oConvert->loadOuter($xMiddle, $oRule2);

  #or

  my @aChain=($oRule1, $oRule2, $oRule3);
  foreach my $oRule (@aChain) {
     $xData = $oConvert->loadOuter($xData, $oRule);
  }

=head1 EXPORTS

Nothing is exported by default.  You can optionally export
the following functions:

* C<makeRule(...)>

=head1 BUGS and CAVEATS

* When Phase II copied a hash, array, or object reference,
  all references to the copied object change in tandem. This,
  however, only applies to blessed and unblessed references
  to hashes and arrays.  If you have other sorts of blessed
  or unblessed data, you will have to insure that references
  change in tandem manually.

=head1 ROADMAP

* provide better support for references to references to
  scalars and code references.

* let C<Data::Morph> objects store a mime type corresponding
  to their serialization format

* apply a rule to specific array elements or slice

* apply a rule to all keys matching a regex

* allow each C<Data::Morph> instance to store a library of named
rules, so that applications do not need to keep track of
which rules belong to which formats.  Then you could do
something like this to dump in multiple formats

   foreach ($oYAML, $oXML, $oJSON) {
     #some code to spit out the mime type for $_
     print $_->dump($xData, 'personInUserFriendlyFormat');
     #some code to mark the end of this mime type
   }


=head1 SEE ALSO

Similar modules are discussed above in the section titled
L</Alternatives to Data::Morph>.

=head1 AUTHOR

Elizabeth Grace Frank-Backman

=head1 COPYRIGHT

Copyright (c) 2008- Elizabeth Grace Frank-Backman. All rights
reserved.

This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
[download]

The module code (Data/Morph.pm)

use strict;
use warnings;

package Data::Morph;
use Scalar::Util;
use Carp;

my $CLASS = __PACKAGE__;

use base 'Exporter';
our @EXPORT_OK = qw(makeRule);

my $MSG_BAD_SLICE
  = "Parameter list <%s> contains a bad array index: <%s>";


#==================================================================
# FUNCTIONS, I
#==================================================================

sub unblessCopy {
  my $xData = shift;

  return $xData unless Scalar::Util::blessed($xData);
  my $sDataRef = Scalar::Util::reftype($xData);

  if ($sDataRef eq 'ARRAY') {
    return [ @$xData ];
  } elsif ($sDataRef eq 'HASH') {
    return { %$xData };
  } elsif ($sDataRef eq 'SCALAR') {
    my $sTmp = $$xData;
    return \$sTmp;
  } elsif ($sDataRef eq 'REF') {
    my $sTmp = $$xData;
    return \$sTmp;
  } elsif ($sDataRef eq 'CODE') {
    # borrowed from Acme::Curse (author Moritz Lenz)
    return sub { goto &$xData };
  } else {
    return $xData;
  }
}

#==================================================================
# HELPER CLASSES
#==================================================================

my $RULE_CLASS = 'Data::Morph::Rule';
{
  package Data::Morph::Rule;

  sub new {
    my ($sClass, $xLoad, $xDump, $xPrep
        , $sDefaultParamName) = @_;

    if (! defined($xDump)) {
      $xDump = \&Data::Morph::unblessCopy;
    } elsif ($xDump eq '') {
      $xDump = undef;
    }

    my $hRule = { load => $xLoad
                , dump => $xDump
                , prep => $xPrep
                , paramName => $sDefaultParamName};
    return bless($hRule, $sClass);
  }

  sub getDefaultParamName { return shift->{paramName} }
  sub getDump     { return shift->{dump} }
  sub getLoad     { return shift->{load} }
  sub getPrepRule { return shift->{prep} }

  sub dump {
    my ($self, $xData, $hReplace, $aPath) = @_;
    my $xDump = $self->getDump();
    return $xData unless defined($xDump);

    return &$xDump($xData, 0, $hReplace, $aPath)
      if (ref($xDump) eq 'CODE');

    #dump is the name of a method
    #print STDERR "Data::Morph::Rule::dump: <$xDump>\n";
    my $sEval = "\$xData->$xDump(\$hReplace, \$aPath)";
    my $sRetVal = eval($sEval);
    #print STDERR "eval<$sEval> retval=<" 
    #  . (defined($sRetVal) ? $sRetVal : 'undef') . ">\n";
    return scalar eval($sEval);
  }

  sub load {
    my ($self, $xData, $hReplace, $aPath) = @_;
    my $xLoad = $self->getLoad();
    return $xData unless defined($xLoad);

    return &$xLoad($xData, 1, $hReplace, $aPath)
      if (ref($xLoad) eq 'CODE');

    my $xParams = $xData;
    if (ref($xData) eq '') {
      my $sDefault = $self->getDefaultParamName();
      $xParams = { $sDefault => $xData } if defined($sDefault);
    }
    return Data::Morph::_applyStringRule($xParams, $xLoad, 1);
  }
}

#==================================================================
# FUNCTIONS, II
#==================================================================

sub makeRule {
  return $RULE_CLASS->new(@_);
}

#==================================================================
# CLASS METHODS
#==================================================================

sub newCustom {
  my ($sClass, $crDump, $crLoad) = @_;
  my $self = { loader => $crLoad, dumper => $crDump };
  return bless($self, $sClass);
}

sub newSerializer {
  my $sClass = shift @_;
  eval("require Data::Serializer; return 1")
    or do { return undef; };

  my $oSerializer = Data::Serializer->new(@_);
  my $crDump = sub {
    my $xData = shift;
    return $oSerializer->serialize($xData);
  };
  my $crLoad = sub {
    my $xData = shift;
    return $oSerializer->deserialize($xData);
  };
  return $sClass->newCustom($crDump, $crLoad);
}

sub new {
  my ($sClass, $sSerializer, $sDump, $sLoad) = @_;
  $sDump = 'Dump' unless defined($sDump);
  $sLoad = 'Load' unless defined($sLoad);

  #print STDERR "<$sDump> <$sLoad>\n";

  my ($crLoad, $crDump);
  if (defined($sSerializer)) {
    # don't import Dump - otherwise we would
    # redefine Data::Morph::dump and
    # set it to Data::Dumper::Dump
    eval("require $sSerializer;"
         . "\$crDump=\\&${sSerializer}::$sDump if \$sDump;"
         . "\$crLoad=\\&${sSerializer}::$sLoad if \$sLoad;"
         . "return 1;") or do { return undef; };
  }
  #print STDERR "<$crLoad> <$crDump>\n";
  return $sClass->newCustom($crDump, $crLoad);
}

#==================================================================
# PUBLIC OBJECT METHODS
#==================================================================

#synonyms

BEGIN {
  *freezeInner      = *dumpInner;
  *serializeInner   = *dumpInner;
  *freezeOuter      = *dumpOuter;
  *serializeOuter   = *dumpOuter;
  *freeze           = *dump;
  *serialize        = *dump;

  *thawInner        = *loadInner;
  *deserializeInner = *loadInner;
  *thawOuter        = *loadOuter;
  *deserializeOuter = *loadOuter;
  *thaw             = *load;
  *deserialize      = *load;
}

sub getDumper() {
  return shift->{dumper};
}

sub getLoader() {
  return shift->{loader};
}

# serialize, freeze

sub dump {
  my ($self, $xInner, $xRule) = @_;
  #print STDERR "dump: <$xInner>\n";
  return $self->dumpOuter($self->dumpInner($xInner, $xRule));
}

sub dumpInner {
  my ($self, $xInner, $xRule) = @_;
  #print STDERR "dumpInner: <$xInner>\n";

  my $hReplace = {};
  my $xDump = _applyRule($xInner, $xRule, 0, $hReplace, []);
  return _fixReferences($xDump, $hReplace, []);
}

sub dumpOuter {
  my ($self, $xOuter) = @_;
  my $crDump = $self->getDumper();
  #print STDERR "dumpOuter: <$xOuter> <$crDump>\n";
  return $crDump ? &$crDump($xOuter) : $xOuter;
}

# deserialize, thaw

sub load {
  my ($self, $sData, $xRule) = @_;
  return $self->loadInner($self->loadOuter($sData), $xRule);
}

sub loadInner {
  my ($self, $xOuter, $xRule) = @_;
  my $hReplace = {};
  my $xLoad = _applyRule($xOuter, $xRule, 1, $hReplace, []);
  return _fixReferences($xLoad, $hReplace, []);
}

sub loadOuter {
  my ($self, $sData) = @_;
  my $crLoad = $self->getLoader();
  return $crLoad ? &$crLoad($sData) : $sData;
}

#==================================================================
# PRIVATE OBJECT METHODS
#==================================================================

#==================================================================
# PRIVATE FUNCTIONS
#==================================================================

sub _applyArrayRule {
  my ($xData, $xRule, $bLoad, $hReplace, $aPath) = @_;

  #modify copy of array, not original array
  #but do so in a way that preserves references
  my $idData = Scalar::Util::refaddr($xData);
  my $aData = $hReplace->{$idData};
  unless (defined($aData)) {
    $hReplace->{$idData} = $aData = [ @$xData ];
  }
  #print STDERR "_applyArrayRule: <$xData> <$idData> <$aData>\n";

  my $iRuleCount = $#$xRule;
  for(my $i=0; $i<=$#$xData; $i++) {
    my $xValueRule = $i < $iRuleCount
      ? $xRule->[$i] : $xRule->[-1];
    $aData->[$i] = _applyRule($xData->[$i], $xValueRule, $bLoad
                              , $hReplace, $aPath);

  }
  #print STDERR "_applyArrayRule: <@$aData>\n";
  return $aData;
}

sub _applyHashRule {
  my ($xData, $xRule, $bLoad, $hReplace, $aPath) = @_;

  # scan to see if any keys apply to $xData
  my ($k,$v);
  my $bModified=0;
  while (($k,$v) = each(%$xData)) {
    next unless exists($xRule->{$k});
    $bModified=1;
    last;
  }
  return $xData unless $bModified; #internal cursor already reset
  while (each(%$xData)) {};        #reset internal cursor


  # copy data so that changes to hash key values don't affect the
  # original
  my $idData = Scalar::Util::refaddr($xData);
  my $hData = $hReplace->{$idData};
  unless (defined($hData)) {
    $hReplace->{$idData} = $hData = { %$xData };
  }
  #print STDERR "_applyHashRule: <$xData> <$idData> <$hData>\n";

  while (($k,$v) = each(%$xData)) {
    if (exists($xRule->{$k})) {
      my $xValueRule = $xRule->{$k};
      #print STDERR "_applyHashRule: <$k> <$bLoad>\n";

      #print STDERR "_applyHashRule: valueRule=<"
      #  . (defined($xValueRule) ? $xValueRule : 'undef')
      #  . ">\n";

      $hData->{$k} = _applyRule($v, $xValueRule, $bLoad
                                , $hReplace, $aPath);

      #print STDERR "_applyHashRule: data=<"
      #  . (defined($hData->{$k}) ? $hData->{$k} : 'undef')
      #  . ">\n";
    } else {
      $hData->{$k} = $v;
    }
  }
  return $hData;
}

sub _applyRule {
  my ($xData, $xRule, $bLoad, $hReplace, $aPath, $bNewRule) = @_;
  return $xData unless defined($xRule);

  #if we've replaced the rule for processing the data, then we
  #needn't worry about circles.

  if (!$bNewRule && ref($xData)) {
    #stop traversal at circularities
    foreach (@$aPath) { return $xData if ($xData eq $_); }
    $aPath = [ @$aPath, $xData ];
  }

  my $sRuleRef = ref($xRule);
  #print STDERR "_applyRule: <$sRuleRef>\n";
  if ($sRuleRef eq 'CODE') {
    #apply a function to load and unload the data
    return &$xRule($xData, $bLoad, $aPath, $hReplace);
  } elsif ($sRuleRef eq '') {
    #rule = bless/unbless data
    return _applyStringRule($xData, $xRule, $bLoad);

  } elsif ($sRuleRef eq 'ARRAY') {
    if (ref($xData) eq 'ARRAY') {
      return _applyArrayRule($xData, $xRule, $bLoad
                             , $hReplace, $aPath);
    }
  } elsif ($sRuleRef eq 'HASH') {
    if (ref($xData) eq 'HASH') {
      return _applyHashRule($xData, $xRule, $bLoad
                            , $hReplace, $aPath);
    }
  } elsif (Scalar::Util::blessed($xRule)) {
    #apply a custom rule after loading the data
    my $xValueRule = $xRule->getPrepRule();

    #print STDERR "_applyRule: load=<$bLoad> prepData=<"
    #    . (defined($xValueRule) ? $xValueRule : 'undef') . ">\n";
    #print STDERR "_applyRule: before <"
    #  . (defined($xData) ? "$xData|@{[%$xData]}" : 'undef') . ">\n";


    if ($bLoad) {
      if (defined($xValueRule)) {
        $xData = _applyRule($xData, $xValueRule, $bLoad
                            , $hReplace, $aPath, 1);
      }
      #print STDERR "_applyRule: after load <"
      #  . (defined($xData) ? "$xData|@{[%$xData]}" : 'undef')
      #  . ">\n";

      return $xRule->load($xData, $hReplace, $aPath);
    } else {
      $xData = $xRule->dump($xData, $hReplace, $aPath);
      #print STDERR "_applyRule: after dump <"
      #  . (defined($xData) ? $xData : 'undef') . ">\n";

      return defined($xValueRule)
        ? _applyRule($xData, $xValueRule, $bLoad
                     , $hReplace, $aPath)
        : $xData;
    }
  }
  return $xData;
}

sub _applyStringRule {
  my ($xData, $sRule, $bLoad) = @_;

  #print STDERR "_applyStringRule: <$bLoad> <$sRule>\n";

  if ($bLoad) {
    # remove whitespace so we don't have to fix parameter names
    # with leading or trailing whitespace
    my ($sClass, $sMethod, $sParams) = _splitStringRule($sRule);

    if ($sMethod) {
      my $aParams = _buildParams($sParams, $xData);
      #print STDERR "class=<$sClass> method=<$sMethod> "
      #  ."params=<@$aParams>\n";

      # if there is a bug in the evaluated code and the
      # bug occurs in array context, then () will be returned
      # rather than a scalar.  Since callers to this method
      # always expect a scalar result, we force the scalar
      # context using scalar - thanks PerlMonk [Anno]
      # for the suggestion.
      return scalar eval("$sClass->$sMethod(\@\$aParams)");
    }

    return $xData unless ref($xData);
    return bless($xData, $sClass);
  }
  return unblessCopy($xData);
}

sub _buildParams {
  my ($sParams, $xData) = @_;
  my $sDataRef = ref($xData);

  my @aParams;
  if (defined($sParams)) {
    my @aParamNames = split(/,/, $sParams);
    if ($sDataRef eq 'HASH') {
      foreach (@aParamNames) {
        #print STDERR "param=<$_>\n";
        push @aParams, $xData->{$_};
      }
    } elsif ($sDataRef eq 'ARRAY') {
      foreach my $sName (@aParamNames) {
        if ($sName =~ /^(\d+)\.\.$/) {
          $sName ="$1..$#$xData";
        } elsif ($sName !~ /^\d+(?:\.\.\d+)?$/) {
          #bad data - skip it
          carp(sprintf($MSG_BAD_SLICE, $sParams, $sName));
          next;
        }
        push @aParams, eval "\@\$xData[$sName]";
      }
    } else {
      push @aParams, $xData;
    }
  } else {
    push @aParams, $xData;
  }
  return \@aParams;
}

sub _fixReferences {
  my ($xData, $hReplace, $aPath) = @_;
  # note: it is very important that pure scalar data be
  # returned immediately.  Past versions had problems with
  # Storable because their data got stringified when it
  # was used as a hash key. This changed internal flags on
  # the data and caused both Storable and JSON to dump the
  # data as a string even though its internal memory representation
  # was as an integer.

  my $sRef = Scalar::Util::reftype($xData);
  return $xData unless $sRef;

  #stop traversal at circularities
  foreach (@$aPath) { return $xData if ($xData eq $_); }
  $aPath = [ @$aPath, $xData ];

  #check to see if the reference has already been replaced
  my $idData = Scalar::Util::refaddr($xData);
  my $xReplace = $hReplace->{$idData};
  return $xReplace if defined($xReplace);
  #print STDERR "_fixReferences: <$idData> <$xData>\n";

  if ($sRef eq 'HASH') {
    while (my ($k,$v) = each(%$xData)) {
      $xData->{$k} = _fixReferences($v,$hReplace,$aPath);
    }
  } elsif ($sRef eq 'ARRAY') {
    for(my $i=0; $i<=$#$xData; $i++) {
      $xData->[$i]= _fixReferences($xData->[$i], $hReplace, $aPath);
    }
  }
  return $xData;
}

sub _splitStringRule {
  my ($sRule) = @_;
  $sRule =~ s/\s//g;
  return ($sRule =~ /^((?:\w+::)*\w+)(?:->(\w+)(?:\(([^\)]*)\))?)?$/);
}


#==================================================================
# MODULE INITIALIZATION
#==================================================================

1;
[download]

Comment on Re: RFC: Managing data in multiple formats Select or Download Code