in reply to RFC: Managing data in multiple formats
The text below was moved from the original post so it doesn't obscure the request for comment above.
Existing CPAN modules and test plan:
Modules currently on CPAN CPAN has many, many modules dedicated to data serialization and transformation. The closest thing I've found to this module is Data::Serializer. That module provides a common interface for dumping and loading to various serialization formats, but it doesn't provide any sort of systemic support for situations where the internal programmer-friendly data structure differs from the data structure that is actually serialized. This issue is discussed in more depth in the POD documentation below.
Testing strategy The Data::Morph module has less than 500 lines of code, but it does a lot in those 500 lines and it requires particularly intensive testing. The test suite currently contains 882 tests covering
Combinations of module, data sample, and rule been tested to verify that (a) the dump string matches an expected value, (b) loading the expected dump string generates the original internal representation, and (c) the process of dumping does not modify the original data.
I feel especially concerned about the documentation. Without good documentation, this module is nearly useless. The documentation is rather long. It tries to cover the concept behind the module, the role of the module in light of the many, many CPAN modules for data conversion and manipulation, and, of course the specifics of defining data transformations using this module. Despite its length I suspect that many things are still unclear. I also wonder if I need all of the material I've added to give the module context.
The POD documenttion (Data/Morph.pod)
=head1 NAME B<Data::Morph> - manage deeply nested data and compound objects with multiple serialization and in-memory formats. =head1 SYNOPSIS Associate a serialization protocol with a C<Data::Morph> object: my $oYAML = Data::Morph->new('YAML'); my $oConverter = Data::Morph->newCustom(\&freezeMe, \&thawMe); my $oSerializer = Data::Morph->newSerializer(serializer=>'XML'); Define a rule for converting in memory data to and from a format, using one of the following rule definition approaches: # bless the data when you read it in # strip the data when you read it out $rule = 'Mine::Foo'; # pass the data to a constructor when you read it in # strip the blessing when you read it out $rule = 'Mine::Foo->new; # select data from a hash when you read it in # strip the blessing when you read it out $rule = 'Mine::Foo->new(name,age)'; # select data from an array when you read it in # strip the blessing when you read it out $rule = 'Mine::Foo->new(2,4..,3,1); # define your own rule for reading data in and out sub myConvertFunc { my ($xData, $bLoad, $hReplace, $aPath) = @_; ... do stuff ... return $xConverted; } $rule = \&myConvertFunc; # read data in bottom up using a rule of your choice # dump data top down possibly using a different rule $rule = Data::Morph::Rule($load, $dump, $rule, $default); $rule = Data::Morph::makeRule($load, $dump, $rule, $default); { use Data::Morph qw(makeRule); $rule = makeRule($load, $dump, $rule, $default); } Then use the object to dump and load data: # To dump the data ($rule is optional) $oYAML->dump($someData, $rule); $oCustom->dump($someData, $rule); $oYAML->freeze($someData, $rule); #alias for dump $oYAML->serialize($someData, $rule); #alias for dump # To load the data ($rule is optional) $oYAML->load($someData, $rule); $oCustom->load($someData, $rule); $oYAML->thaw($someData, $rule); #alias for load $oYAML->deserialize($someData, $rule); #alias for load # For additional methods, see L</Converting data>. =head1 DESCRIPTION This module provides tools that will help you manage data that needs to have multiple serialization or in-memory representations. Its goal is to remove the gruntwork and pro-forma flow of control code so that you can focus more closely on the actual formats and conversion logic you want. First, what this module does I<not> do. It does not automagically convert any data to any other data. Any tool that tried to do that would be so general and bloated that it would be the equivalent of hammering a nail with a jack-hammer. Rather it provides a framework for managing the pipeline from object to data massaging routine to serialization module and back again. It also provides several tools for (a) defining how to load and dump data from a particular serialization format (b) massaging data before dumping and after loading. =head2 Why multiple formats for the same data? CPAN provides support for a wide variety of serialization formats, but not all of these modules can deliver data in exactly the format you want. That means you need to massage the incoming and outgoing data. Massaging this data means that you need at least two formats in addition to your normal programmer-friendly in-memory representation: the intermediate massaged format and the actual serialized form of the data. The need to support multiple formats arises from many other directions as well: * multiple use cases: data dumps for debugging, pretty user friendly formats that are easy for end users to edit, serialization for interprocess communication to name the most common. * internal format changes: sometimes the internal format of data changes in ways that have no real effect on configuration data. I want to keep the configuration and data files the same, but the internal format no longer matches the data structures implied by the end user format. * backwards compatibility: configuration and data files have changed to support new features, yet I still want (or need) to support older data file formats that may no longer match the internal data structures. * prototyping: several different formats are being considered, but it isn't clear which one will really work out best for use case X. Clients and developers can sometimes have a hard time envisioning how each file would work for them unless they see sample. This is especially true if the data stored in the data files is complex and richly interrelated. However, such complex data is also the hardest to hand craft load and dump routines for. Thus prototyping is most expensive precisely when we need it most. If there were a cheap and easy way to try out multiple formats it could take a lot of the guess work out of these choices. * role-based tools: a lot of serialization functionality happens after the serialization (encryption, compression, inter process communication being just a few examples). Although these tools don't need multiple formats for their own family of objects, they do benefit from the existance of load and dump tools that can handle a wide range of relationships between in-memory data structures and serialization formats. The more capable a load or dump method is, the wider the range of objects it can load and dump. The wider the range of objects these tools can successfully load and dump, the more valuable these role-based tools become. =head2 Alternatives to C<Data::Morph> Although CPAN has many, many modules for converting data from one format to another, all of the modules I was able to find had one or more of following limitations: * they can handle serialization for only a limited set of data structures. For example, L<Data::Any|>, L<Data::Table|>, and L<Tangram> each handle a variety of serialization formats but they expect in-memory data to be unblessed hashes or arrays of hashes. They treat the values assigned to hash keys as strings, numbers, or blobs even if the serialization format (e.g. XML) permits deeply nested data structures. * they provided support for the first part of the pipeline, but not the second. Although various serialization formats are supported, any data massaging happens before or after the use of the module. Examples of CPAN modules with this limitation include L<Data::Serializer|>, L<Data::Any|>, L<Data::Table|>, among others. * they support massaging of data in memory, but this only applies to blessed objects and only one such massage function is supported per class. For example, L<JSON> and L<YAML> both support class specific load and dump routines. L<JSON> looks for a method named C<TO_JSON>. L<YAML> looks for methods named C<yaml_dump> and C<yaml_load>. L<Pixie> and L<Data::Freezer|> which is based on it also take this approach to customization. C<Data::Morph> is designed to pick up where each of these modules leave off. It is intended as a complement to these modules rather than a replacement. It knows how to work together with these modules so that solutions that were initially implemented using these modules can be continue to use that implementation together with C<Data::Morph>. =head3 Scalability issues with per-class serialization methods One significant advantage to using C<Data::Morph> as a framework for your data transformations is that it gives you choices that can make the growth of your code base easy to manage. Class specific load and dump routines can be quite convenient if one is dealing with a small library of objects, but it may not scale well as a code base grows. Maintaining a set of such routines across tens or hundreds of objects can get confusing quite quickly. Such a large number of routines might be easier to manage if they were centralized in some common store. This is not an option when modules limit customization to methods defined within a class. C<Data::Morph> is a bit more scalable because you have full control over where you define your data massaging rules. You can define your rules within the class's module if you wish. If that becomes unweildy, you can move your rules to a central place, if that would be more convenient. As classes mature and get reused, the per-class customization strategy runs into another problem: there can only be one such method for a class. This can cause difficulties if there are multiple use cases for a particular serialization format. For example, one might sometimes use YAML for IPC and sometimes use YAML for a user friendly layout. At yet other times YAML may be used for debugging dumps. IPC demands compactness and efficiency and often replicates the in-memory representation of the data without any massaging. If the data is only being used in memory, integer codes can easily be used in place of strings. On the other hand user friendly layouts generally need much more verbosity. Integer codes need to be translated into human readable strings. Data structures that would generate a lot of arcane syntax may need to be replaced with data structures that can be represented with a subset of human friendly syntax. Debugging dumps usually require something between the verbosity of a human friendly data file and a dense IPC string. A single dump_YAML method would have difficulty supporting all three of these use cases: IPC, debugging output, and user friendly data files. With C<Data::Morph> you have the option of using either the class's own dump and load rules or an external rule that is imposed on the class from the outside. It does not look for a method with a specific name, so you can define as many per-class methods for a serialization format as you need. Furthermore, if you don't have rights to change the class definition, you can define rules and functions for massaging the data that are completely external to the class. C<Data::Morph> can use those as easily as it uses the per-class methods. =head2 The data conversion pipeline The framework divides the conversion process into two phases: * Phase 1: conversion to/from serialization format to a Perl data structure that exactly parallels the serial format. We will call this the "outer" Perl data format. Typically this phase is configured by passing the name of an existing CPAN module or by passing references to functions defined in existing CPAN modules. * Phase 2: customized conversion from the "outer" Perl data format to an internal programmer centric Perl data format. We will call this the "inner" Perl data format. Its purpose is to customize the output of standard modules to a form more suitable to the programmers needs. Because the customization is highly application specific, it is configured with a programmer defined rule rather than "canned modules". =head3 Defining phase 1: serialization format Phase 1 takes care of serializing and deserializing data. To implement the serialization, C<Data::Morph> needs to be given two subroutines: * a dump routine that accepts any scalar or perl reference as its sole parameter * a load routine that accepts a string or other stream as its sole parameter. These two routines are defined when the C<Data::Morph> object is constructed. =head4 Data::Morph object constructors The serialization format used by a C<Data::Morph> object is chosen when the object is constructed. There are three different constructors defined: C<new(...)>, C<newSerializer(...)> and C<newCustom(...)>. =over =item C<new(...)> Selects dump and load routines using the names of modules and functions. # $sModule the name of a module that dumps and loads data # Required # # $sDump the name of a function within $sModule. The # function should take the same parameters as # YAML::Dump or Data::Dumper, i.e. # &$sDump($xData); # # Optional, defaults to 'Dump' # # $sLoad the name of a function within $sModule. The # function should take the same parameters as # YAML::Load, i.e. &$sLoad($xData); # # Optional, defaults to 'Load' $oConvert = Data::Morph->new($sModule, $sDump, $sLoad); =item C<newSerializer(...)> $oConvert = Data::Morph->newSerializer(key => val, ...); C<newSerializer(...)> converts data using an instance of L<Data::Serializer>. The the parameters passed to the constructor are used to build the L<Data::Serializer> object. =item C<newCustom(...)> This method gives one the fullest possible control over conversion. You provide code references to subroutines for dumping and loading data. # $crDump a code reference to a subroutine or closure # # $crLoad a code reference to a subroutine or closure $oConvert = Data::Morph->newCustom($crDump, $crLoad); =back =head4 Recipes for specific serialization formats CPAN provides a number of excellent modules for converting back and forth between data structures to strings, so you won't normally need to write your own functions. =over =item L<YAML> Since L<YAML> provides a C<Dump(...)> and C<Load(...)> method we can just pass the name of the module - it will be automatically required: my $oConvert = Data::Morph->new('YAML'); =item L<JSON> JSON is an alternative to YAML. It is designed to capture data using very simple data types: scalars, arrays, hashes and combinations of the two. It does not have a way to encode class identity in the string output. Out of the box, the JSON module can only accept arrays and hashes as input. Other kinds of input (classes, pure scalars) will cause it to fail. It also dumps hashes in random order. This can make it difficult to compare dumped files. To fix these issues one must call the dump and load routines with an option hash. We also need to set the C<-convert_blessed_universally> flag. Normally, JSON only unloads blessed objects that have a TO_JSON method. If we use JSON with C<-convert_blessed_universally> then JSON provides a default definition for that method. To use configured calls with <Data::Morph>, we wrap them in subs and build the <Data::Morph> object with the C<new_custom> constructor: use JSON qw(2.14 -convert_blessed_universally); sub jsonDump { return JSON::to_json(shift, {canonical=>1, allow_nonref=>1 , allow_blessed => 1 , convert_blessed => 1}); } sub jsonLoad { return JSON::from_json(shift, { allow_nonref => 1} ); } my $oConvert = Data::Morph->newCustom (\&jsonDump,\&jsonLoad); =item L<Data::Dumper|> L<Data::Dumper> provides a dump routine; you must however provide your own load routine: use Data::Dumper; sub evalDump { my $sEval = shift; my $VAR1; return eval($sEval); } my $oConvert = Data::Morph->newCustom (\&Data::Dumper::Dump,\&evalDump); =item L<Storable> L<Storable> provides C<freeze(...)> and C<thaw(...)> methods for serializing data. Since these methods expect a reference to the data rather than the data, we must wrap their calls in subroutines and use the C<newCustom(...)> constructor: use Storable; sub freezeBinary { my $xData=shift; # pass reference return Storable::freeze(\$xData); } sub thawBinary { my $xData=shift; # dereference data return ${Storable::thaw($xData)}; } my $oConvert = Data::Morph::newCustom(\&freezeBinary ,\&thawBinary); More elaborate uses of L<Storable> are also possible. For example, if one passed references to closures one could build a converter that read and wrote to a specific file: use Storable; my $sOutfile = 'foo.dat'; my $crDump = sub { my $xData=shift; return Storable::nstore(\$xData, $sOutfile); } my $crLoad = sub { my $xData=shift; return ${Storable::retrieve($sOutfile)}; } my $oConvert = Data::Morph::newCustom($crDump,$crLoad); =item L<Data::Serializer|> The L<Data:Serializer> provides a standard interface for data serialization, encryption, and compression. Because of the large number of serialization formats supported by this module, C<Data::Morph> provides a constructor dedicated to this module. It accepts any parameters that would normally be passd to the constructor of L<Data::Serializer> and constructs a serializer object with those parameters: $oConvert = Data::Morph->newSerialier(serializer => 'YAML'); $oConvert = Data::Morph->newSerializer(serializer => 'XML'); $oConvert = Data::Morph->newSerializer($serializer => 'JSON'); =back =head3 Defining phase 2: conversion rules This module supports rules for converting data from streams and files containing both single and multiple objects. =head4 Handling of complex data Developing data massage routines can get tricky when data is deeply nested or involves shared references, circular references, or part-container relationships. Making it easier to safely and correctly massage such data is one of the goals of the C<Data::Morph> package. Its tools for defining data transformations is integrated with a data navigation engine that automatically performs the following tasks: * Automatic detection of circular references (and prevention of infinte loops caused by attempts to navigate them) * Protection of shared references, even when transformations cause data to be copied. * Preservation of the original data. The transformation leaves the source data unchanged even if the source data includes numerous references. =head4 Data transformation using string rules String conversion rules let you define a blessing or constructor that should be used to create objects. It is primarily used to dump and load data that is serialized as unblessed objects. You can use a string rule to * bless loaded data into a class * pass the loaded data as is to a constructor of your choice * massage the loaded data into a set of parameters to be passed to the constructor. The string conversion rule has a very simply implicit dump rule: it strips the blessing from any blessed array, hash, or scalar reference. This dump rule should be sufficient if your constructor merely cleans up loaded data: checks for bad data, normalizes the representation of values, and sets defaults. However, constructors sometimes modify the basic arrangement of data - converting incoming hashes into inside out objects, or using the data to generate new arrays and hashes with specialized key names. If the object is meant to be opaque and has a fundamentally different structure from the loaded data, then a dump routine that just unblesses the data won't be enough. You will need to define a custom dump routine. To couple a string rule with a custom dump routine, you must create an instance of C<Data::Morph::Rule>. Please see the section below for details. # blesses $xData on load # strips blessing on dump $rule = "Animal::Lion"; $xData = $oConvert->load($sData, $rule); $oConvert->dump($xData, $rule); # on load: converts $sData to a Perl data structure $xLoaded # calls Animal::Lion->new($xLoaded) # on dump: strips blessing $rule = "Animal::Lion->new" $xData = $oConvert->load($sData, $rule); $oConvert->dump($xData, $rule); # on load: massages $xLoaded into a parameter list # Animal::Lion->new($xLoaded->{furColor} # , $xLoaded->{pawSize}); # on dump: strips blessing $rule = "Animal::Lion->new(furColor,pawSize)"; $xData = $oConvert->load($sData, $rule); $oConvert->dump($xData, $rule); You can also use the parameter list to reorder an array of parameters. You may use any parameter list you would use to create an array slice, e.g. C<< @aData[1..4,3,2] >>. In addition, one can use C<< N.. >> to refer to all parameters from N to the end of the array inclusive. # reorders array parameters # N.. means insert parameters from N to end of array $rule = "Animal::Lion->new(2,3,1,0); $rule = "Animal::Lion->new(2..,1,0); =head4 Array conversion rules You can define a rule that applies to all elements of an array by defining a one element array reference containing that rule. Any kind of rule (string, hash, C<Data::Morph::Rule> object, or function) may be used this way. The rule will I<only> be applied to array elements: # apply the rule to all elements of an array my $rule = [ $ruleForEachElement ] =head4 Hash conversion rules You can define a rule for specific hash keys, by creating a hash whose values are rules. As with array rules, any kind of rule can be used this way. A hash rule looks like this: # apply $rule1 to the hash key batman # apply $rule2 to the hash key robin # no rule defined for other keys ( Phase II is a no-op) my $rule = { batman => $rule1, robin => $rule2 } =head4 Massaging data with functions Instead of a string you can define a function to handle the conversion of the Phase I data. Function rules are typically used in one of two very different ways: simple data cleanup that needs to be applied to all members of an array or specific hash keys and complex object conversions. There are two ways to use functions in data transformation rules. First, you can write a function and use it to construct a L<Data::Morph::Rule> object or you can use it as a rule in its own right. This section focuses on how to decide which use is best for you and explores some issues to consider if you choose to use a function as a rule in its own right. A function acting as a rule in its own right, has up to four parameters. Unless you are doing something particularly fancy with nested data and references, you can probably ignore the last two of these: C<$aPath> and C<$hReplace> parameters. $rule = sub { my ($xData, $bLoad, $hReplace, $aPath) = @_; if ($bLoad) { #massage $xData into $xInner return $xInner; } else { #$xData is inner (programmer) representation #massage $xData into $xOuter return $xOuter; } } my $sData = $oConvert->dump($xData, $rule); my $xReloaded = $oConvert->load($sData, $rule); =head5 Function rules used for simple data cleanup Function rules used for simple data cleanup are usually applied to all elements of an array or to selected hash keys. For this reason they are most often found embedded in array and hash rules. For example, if you were reading in a YAML string that defines an array of message template strings, you might want one encoding inside your program and another in the dumped string. To apply the encoding to each element of the array, you might define a function rule like this: use Encode; sub recode { my ($xData,$bLoad) = @_; return $bLoad ? decode("utf8", $xData) : encode("utf8",$xData); }; # [ ... ] says apply the rule to each array element $rule = [ \&recode ]; $aUpperCaseData = $oConvert->load($sData, $rule); $sLowerCaseData = $oConvert->dump($aUpperCaseData, $rule); =head5 Using function rules to dump complex object graphs Another way to use functions is to pass them in a C<Data::Morph::Rule>. If you have an object that stores references to hashes, arrays, and other objects, your objects collectively form an object graph. C<Data::Morph> can navigate that graph if you configure a C<Data::Graph::Rule> object for top-down dumping/bottom-up loading. See C<Data::Morph::Rule> for more information. You can, of course, also define a function and use it alone for the dumping and loading. However, if you do that you will have to manually handle all of the graph navigation - including insuring that (a) transformations do not alter the original data (b) objects that share references continue to share references after they are transformed (c) your load and dump routines do not end up in infinite loops because of circular references. If you choose to completely manage the data navigation involved in dumping and loading, and you want your routine to play nicely with other objects that are being dumped, you will need to know how to use the C<$hReplace> and C<$aData> parameters of the data transformation function. C<$aPath> stores the path of references that have been navigated so far. You can generally ignore it unless your custom function is going to navigate further into the depths of C<$xData> and you are concerned about circular chains of references. C<$hReplace> keeps track of references that contain transformed data. When data is transformed, it should leave the original data unchanged. This means that an array, hash, or object reference storing transformed data needs to be copied and then transformed rather than being modified in place. Sometimes data structures store the same reference in multiple places. The fact that reference X is used in three places may be significant. If one copies a hash or array, one must also find all the other places where that hash or array is used and change them to the reference of the new copied object. C<Data::Morph> keeps track of all of the copied references in the C<$hReplace> array. If you need to copy an array or hash and care that other places that reference it also get changed, then you must add an entry for C<$hReplace> for each array or hash reference you copy. The key of C<$hReplace> is the value returned by passing the reference to L<Scalar::Util::refaddr>. The value is the new reference. =head5 Using function rules to dump simple objects If your object is quite simple (e.g. all values are scalars), the choice of using a function in its own right or as part of a L<Data::Morph::Rule> object is pretty much a matter of style. For example, suppose you have a detailed database for books. Your application lets each user keep a list of their favorite books. The list contains only the name, but when you read load it into memory you look up the name of each book in a database. When you dump it you, just want the name. So even though the data inside the object may be very complex, the data you actually dump and load is just a simple string. For a case like this there are no complex object graphs to navigate so you can handle the whole dump and load process quite easily in a function. might want to define a simple function rule like this: sub dumpOrLoadABook { my ($xData,$bLoad) = @_; if ($bLoad) { ... look up book with favorite DBx module return $oBook; } else { return $xData->getBookName(); } }; # [ ... ] says apply the rule to each array element $rule = [ \&dumpOrLoadABook ]; $aUpperCaseData = $oConvert->load($sData, $rule); $sLowerCaseData = $oConvert->dump($aUpperCaseData, $rule); But even when your, you still might want to use a C<Data::Morph::Rule> object. A rule object would let you take advantage of the fact that the dump routine only needs a getter method. It also lets you define a dedicated load routine: sub loadBook { my $xData = shift; ... look up book with favoriate DBx module return $oBook; } $rule = [ Data::Morph::Rule->new(\&loadBook, 'getBookName') ]; =head4 C<Data::Morph::Rule> If you would like something between the full control of a function and a simple string rule, you can define a rule object. # $load a string or function rule, but used only for # loading. # # This parameter is optional. If missing the load rule # is a no-op, i.e. # sub { my $xData=shift; return $xData; } # # $dump One of four possible values: # # - a string rule (see above) # - a function rule (see above) # - the name of a method to call on the data # being converted. The data being converted is # assumed to be an object. The method shoul expect # the following parameters: # # sub { # my ($hReplace, $aPath) = @_; # # convert the data # return $xConverted; # } # # - the empty string - causes the data to be # returned as is, with no conversion. # # This parameter is optional. If missing the dump rule # is the same as a string rule, i.e. if the data is a # blesse reference, it makes a shallow copy of the # referenced data but omits the blessing from the copy. # # $rule a rule for bottom up loading and top down dumping. # During the load process, the rule is used to prepare # the data before passing it to the load rule. When # data is unloaded the rule is applied to the # *output* of $dump function. # # This parameter is optional. If missing, # the data will be passed to the load rule as is. # # $default the default key name. If present, any scalar # data being loaded is converted to a hash # reference where { $default => $xData } Data::Restruture::Rule->new($load, $dump, $rule, $default); The C<$load> and C<$dump> parameters let one provide entirely independent logic for loading and dumping data. C<$rule> and C<$default> are used to massage the data passed to C<$load>. C<$rule> is used to primarily for bottom up processing of deeply nested data structures. For example, when we read in the following YAML data using out-of-the-box functionality, we will get a hash with embedded array references which in turn have embedded hashes (HoAoH): --- orderNumber: 123 lines: - product: trombones quantity: 76 - product: cornets quantity: 110 ... Now suppose we have an order object constructor that expects an array of order line objects as one of its parameters. We can't pass the YAML data as is because that data represents the order lines as an array of hashes rather than an array of order line objects. To solve this problem, we need to do bottom up conversion of data. First we need to convert the deeply nested order line hashes to order line objects. Only after this conversion can we pass the data to the object constructor. We can use the C<$rule> parameter of a C<Data::Morph::Rule> object to define such a bottom up conversion rule: #convert an order line into instances of Acme::OrderLine $oOrderLineRule = "Acme::OrderLine->new(product,quantity)"; # prepare data by converting all order lines into objects # for an explanation of {...} and [...] see the section # below on files and streams containing multiple objects $oPrepData = { lines => [ $oOrderLineRule ] }; # call this constructor after preparing the data $sLoadAndDump = "Acme::Order::new(orderNumber, lines)"; # now put it all together into a single rule that preps # the data and loads it $oOrderRule = Data::Morph::Rule->new ( $sLoadAndDump, undef, $oPrepData); Now we can use the rule to dump and load order objects, like this: $aData = $oConvert->load($sData, $oOrderRule); $sData = $oConvert->dump($aData, $oOrderRule); =head4 Composing small rules into larger ones The tools for building conversions rules are designed to be composable. That is, you can define a rule for a simple object and the make it part of a more complex rule for more complex objects. The arrays, hashes, functions and objects storing rules can be nested inside one another to create increasingly complex conversion rules. For example, suppose you wanted to convert a YAML string like the one below into an internal representation containing author and citation objects built with custom constructors: --- authors: - Emily Dickenson - William Shakespeare - Charlotte Bronte - Langston Hughs citations: - work: Romeo and Juliet author: William Shakespeare - work: Wuthering Heights author: Emily Bronte - work: Hold Fast to Dreams author: LangstonHughs ... Out of the box, C<YAML::Load> would set up a hash containing two arrays. But suppose instead we wanted each author and each citation to be an object? The following code would insure that the conversion rules were properly applied to each instance of author and citation in the YAML file. # define a rule for converting author name strings to # author objects my $ruleAuthor = Data::Morph::Rule ("Foo::Author->newFromString" , sub { return shift->getAuthorName(); }); # define a rule for converting citation hashes to citation # objects my $ruleCitation = "Foo::Citation->new(work,author)"; # * applies the the author rule to each element of the array # assigned to the authors key # * applies the citations rule to each element of the array # assigned to the citations key my $rule = { authors => [$ruleAuthor] , citations => [$ruleCite] } $oYAML->load($someData, $rule); =head2 Doing only Phase I or Phase II Serialization generally requires both phases, but there may be times when only one of the two phases is needed by an appliation. The C<Data::Morph> object provides separate functions for each phase so you can do one or both phases, as need be: # Phase 1 - load # load data into default Perl data structures defined by # serialization modules $oConvert->loadOuter($xData); $oConvert->thawOuter($xData); #alias $oConvert->deserializeOuter($xData); #alias # Phase 2 - load # convert Perl data generated by serialization modules to # programmer-friendly representation. $oConvert->loadInner($xData, $rule); $oConvert->thawInner($xData, $rule); #alias $oConvert->deserializeInner($xData, $rule); #alias # Phase 2 - dump # dump from programmer representation to data structures # easily understood by serialization modules $oConvert->dumpInner($xData, $rule); $oConvert->freezeInner($xData, $rule); #alias $oConvert->serializeInner($xData, $rule); #alias # Phase 1 - dump # serialize data from easily understood format $oConvert->dumpOuter($xData); $oConvert->freezeOuter($xData); #alias $oConvert->serializeOuter($xData); #alias =head2 Transformation chains So far we have focused on conversions back and forth from string to programmer data. However, C<Data::Morph> can also be used to transform data from one programmer representation to another by chaining together calls to C<dumpInner(...)> and C<loadOuter(...)>: my $xMiddle = $oConvert->dumpInner($xStart, $oRule1); my $xEnd = $oConvert->loadOuter($xMiddle, $oRule2); #or my @aChain=($oRule1, $oRule2, $oRule3); foreach my $oRule (@aChain) { $xData = $oConvert->loadOuter($xData, $oRule); } =head1 EXPORTS Nothing is exported by default. You can optionally export the following functions: * C<makeRule(...)> =head1 BUGS and CAVEATS * When Phase II copied a hash, array, or object reference, all references to the copied object change in tandem. This, however, only applies to blessed and unblessed references to hashes and arrays. If you have other sorts of blessed or unblessed data, you will have to insure that references change in tandem manually. =head1 ROADMAP * provide better support for references to references to scalars and code references. * let C<Data::Morph> objects store a mime type corresponding to their serialization format * apply a rule to specific array elements or slice * apply a rule to all keys matching a regex * allow each C<Data::Morph> instance to store a library of named rules, so that applications do not need to keep track of which rules belong to which formats. Then you could do something like this to dump in multiple formats foreach ($oYAML, $oXML, $oJSON) { #some code to spit out the mime type for $_ print $_->dump($xData, 'personInUserFriendlyFormat'); #some code to mark the end of this mime type } =head1 SEE ALSO Similar modules are discussed above in the section titled L</Alternatives to Data::Morph>. =head1 AUTHOR Elizabeth Grace Frank-Backman =head1 COPYRIGHT Copyright (c) 2008- Elizabeth Grace Frank-Backman. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The module code (Data/Morph.pm)
use strict; use warnings; package Data::Morph; use Scalar::Util; use Carp; my $CLASS = __PACKAGE__; use base 'Exporter'; our @EXPORT_OK = qw(makeRule); my $MSG_BAD_SLICE = "Parameter list <%s> contains a bad array index: <%s>"; #================================================================== # FUNCTIONS, I #================================================================== sub unblessCopy { my $xData = shift; return $xData unless Scalar::Util::blessed($xData); my $sDataRef = Scalar::Util::reftype($xData); if ($sDataRef eq 'ARRAY') { return [ @$xData ]; } elsif ($sDataRef eq 'HASH') { return { %$xData }; } elsif ($sDataRef eq 'SCALAR') { my $sTmp = $$xData; return \$sTmp; } elsif ($sDataRef eq 'REF') { my $sTmp = $$xData; return \$sTmp; } elsif ($sDataRef eq 'CODE') { # borrowed from Acme::Curse (author Moritz Lenz) return sub { goto &$xData }; } else { return $xData; } } #================================================================== # HELPER CLASSES #================================================================== my $RULE_CLASS = 'Data::Morph::Rule'; { package Data::Morph::Rule; sub new { my ($sClass, $xLoad, $xDump, $xPrep , $sDefaultParamName) = @_; if (! defined($xDump)) { $xDump = \&Data::Morph::unblessCopy; } elsif ($xDump eq '') { $xDump = undef; } my $hRule = { load => $xLoad , dump => $xDump , prep => $xPrep , paramName => $sDefaultParamName}; return bless($hRule, $sClass); } sub getDefaultParamName { return shift->{paramName} } sub getDump { return shift->{dump} } sub getLoad { return shift->{load} } sub getPrepRule { return shift->{prep} } sub dump { my ($self, $xData, $hReplace, $aPath) = @_; my $xDump = $self->getDump(); return $xData unless defined($xDump); return &$xDump($xData, 0, $hReplace, $aPath) if (ref($xDump) eq 'CODE'); #dump is the name of a method #print STDERR "Data::Morph::Rule::dump: <$xDump>\n"; my $sEval = "\$xData->$xDump(\$hReplace, \$aPath)"; my $sRetVal = eval($sEval); #print STDERR "eval<$sEval> retval=<" # . (defined($sRetVal) ? $sRetVal : 'undef') . ">\n"; return scalar eval($sEval); } sub load { my ($self, $xData, $hReplace, $aPath) = @_; my $xLoad = $self->getLoad(); return $xData unless defined($xLoad); return &$xLoad($xData, 1, $hReplace, $aPath) if (ref($xLoad) eq 'CODE'); my $xParams = $xData; if (ref($xData) eq '') { my $sDefault = $self->getDefaultParamName(); $xParams = { $sDefault => $xData } if defined($sDefault); } return Data::Morph::_applyStringRule($xParams, $xLoad, 1); } } #================================================================== # FUNCTIONS, II #================================================================== sub makeRule { return $RULE_CLASS->new(@_); } #================================================================== # CLASS METHODS #================================================================== sub newCustom { my ($sClass, $crDump, $crLoad) = @_; my $self = { loader => $crLoad, dumper => $crDump }; return bless($self, $sClass); } sub newSerializer { my $sClass = shift @_; eval("require Data::Serializer; return 1") or do { return undef; }; my $oSerializer = Data::Serializer->new(@_); my $crDump = sub { my $xData = shift; return $oSerializer->serialize($xData); }; my $crLoad = sub { my $xData = shift; return $oSerializer->deserialize($xData); }; return $sClass->newCustom($crDump, $crLoad); } sub new { my ($sClass, $sSerializer, $sDump, $sLoad) = @_; $sDump = 'Dump' unless defined($sDump); $sLoad = 'Load' unless defined($sLoad); #print STDERR "<$sDump> <$sLoad>\n"; my ($crLoad, $crDump); if (defined($sSerializer)) { # don't import Dump - otherwise we would # redefine Data::Morph::dump and # set it to Data::Dumper::Dump eval("require $sSerializer;" . "\$crDump=\\&${sSerializer}::$sDump if \$sDump;" . "\$crLoad=\\&${sSerializer}::$sLoad if \$sLoad;" . "return 1;") or do { return undef; }; } #print STDERR "<$crLoad> <$crDump>\n"; return $sClass->newCustom($crDump, $crLoad); } #================================================================== # PUBLIC OBJECT METHODS #================================================================== #synonyms BEGIN { *freezeInner = *dumpInner; *serializeInner = *dumpInner; *freezeOuter = *dumpOuter; *serializeOuter = *dumpOuter; *freeze = *dump; *serialize = *dump; *thawInner = *loadInner; *deserializeInner = *loadInner; *thawOuter = *loadOuter; *deserializeOuter = *loadOuter; *thaw = *load; *deserialize = *load; } sub getDumper() { return shift->{dumper}; } sub getLoader() { return shift->{loader}; } # serialize, freeze sub dump { my ($self, $xInner, $xRule) = @_; #print STDERR "dump: <$xInner>\n"; return $self->dumpOuter($self->dumpInner($xInner, $xRule)); } sub dumpInner { my ($self, $xInner, $xRule) = @_; #print STDERR "dumpInner: <$xInner>\n"; my $hReplace = {}; my $xDump = _applyRule($xInner, $xRule, 0, $hReplace, []); return _fixReferences($xDump, $hReplace, []); } sub dumpOuter { my ($self, $xOuter) = @_; my $crDump = $self->getDumper(); #print STDERR "dumpOuter: <$xOuter> <$crDump>\n"; return $crDump ? &$crDump($xOuter) : $xOuter; } # deserialize, thaw sub load { my ($self, $sData, $xRule) = @_; return $self->loadInner($self->loadOuter($sData), $xRule); } sub loadInner { my ($self, $xOuter, $xRule) = @_; my $hReplace = {}; my $xLoad = _applyRule($xOuter, $xRule, 1, $hReplace, []); return _fixReferences($xLoad, $hReplace, []); } sub loadOuter { my ($self, $sData) = @_; my $crLoad = $self->getLoader(); return $crLoad ? &$crLoad($sData) : $sData; } #================================================================== # PRIVATE OBJECT METHODS #================================================================== #================================================================== # PRIVATE FUNCTIONS #================================================================== sub _applyArrayRule { my ($xData, $xRule, $bLoad, $hReplace, $aPath) = @_; #modify copy of array, not original array #but do so in a way that preserves references my $idData = Scalar::Util::refaddr($xData); my $aData = $hReplace->{$idData}; unless (defined($aData)) { $hReplace->{$idData} = $aData = [ @$xData ]; } #print STDERR "_applyArrayRule: <$xData> <$idData> <$aData>\n"; my $iRuleCount = $#$xRule; for(my $i=0; $i<=$#$xData; $i++) { my $xValueRule = $i < $iRuleCount ? $xRule->[$i] : $xRule->[-1]; $aData->[$i] = _applyRule($xData->[$i], $xValueRule, $bLoad , $hReplace, $aPath); } #print STDERR "_applyArrayRule: <@$aData>\n"; return $aData; } sub _applyHashRule { my ($xData, $xRule, $bLoad, $hReplace, $aPath) = @_; # scan to see if any keys apply to $xData my ($k,$v); my $bModified=0; while (($k,$v) = each(%$xData)) { next unless exists($xRule->{$k}); $bModified=1; last; } return $xData unless $bModified; #internal cursor already reset while (each(%$xData)) {}; #reset internal cursor # copy data so that changes to hash key values don't affect the # original my $idData = Scalar::Util::refaddr($xData); my $hData = $hReplace->{$idData}; unless (defined($hData)) { $hReplace->{$idData} = $hData = { %$xData }; } #print STDERR "_applyHashRule: <$xData> <$idData> <$hData>\n"; while (($k,$v) = each(%$xData)) { if (exists($xRule->{$k})) { my $xValueRule = $xRule->{$k}; #print STDERR "_applyHashRule: <$k> <$bLoad>\n"; #print STDERR "_applyHashRule: valueRule=<" # . (defined($xValueRule) ? $xValueRule : 'undef') # . ">\n"; $hData->{$k} = _applyRule($v, $xValueRule, $bLoad , $hReplace, $aPath); #print STDERR "_applyHashRule: data=<" # . (defined($hData->{$k}) ? $hData->{$k} : 'undef') # . ">\n"; } else { $hData->{$k} = $v; } } return $hData; } sub _applyRule { my ($xData, $xRule, $bLoad, $hReplace, $aPath, $bNewRule) = @_; return $xData unless defined($xRule); #if we've replaced the rule for processing the data, then we #needn't worry about circles. if (!$bNewRule && ref($xData)) { #stop traversal at circularities foreach (@$aPath) { return $xData if ($xData eq $_); } $aPath = [ @$aPath, $xData ]; } my $sRuleRef = ref($xRule); #print STDERR "_applyRule: <$sRuleRef>\n"; if ($sRuleRef eq 'CODE') { #apply a function to load and unload the data return &$xRule($xData, $bLoad, $aPath, $hReplace); } elsif ($sRuleRef eq '') { #rule = bless/unbless data return _applyStringRule($xData, $xRule, $bLoad); } elsif ($sRuleRef eq 'ARRAY') { if (ref($xData) eq 'ARRAY') { return _applyArrayRule($xData, $xRule, $bLoad , $hReplace, $aPath); } } elsif ($sRuleRef eq 'HASH') { if (ref($xData) eq 'HASH') { return _applyHashRule($xData, $xRule, $bLoad , $hReplace, $aPath); } } elsif (Scalar::Util::blessed($xRule)) { #apply a custom rule after loading the data my $xValueRule = $xRule->getPrepRule(); #print STDERR "_applyRule: load=<$bLoad> prepData=<" # . (defined($xValueRule) ? $xValueRule : 'undef') . ">\n"; #print STDERR "_applyRule: before <" # . (defined($xData) ? "$xData|@{[%$xData]}" : 'undef') . ">\n"; if ($bLoad) { if (defined($xValueRule)) { $xData = _applyRule($xData, $xValueRule, $bLoad , $hReplace, $aPath, 1); } #print STDERR "_applyRule: after load <" # . (defined($xData) ? "$xData|@{[%$xData]}" : 'undef') # . ">\n"; return $xRule->load($xData, $hReplace, $aPath); } else { $xData = $xRule->dump($xData, $hReplace, $aPath); #print STDERR "_applyRule: after dump <" # . (defined($xData) ? $xData : 'undef') . ">\n"; return defined($xValueRule) ? _applyRule($xData, $xValueRule, $bLoad , $hReplace, $aPath) : $xData; } } return $xData; } sub _applyStringRule { my ($xData, $sRule, $bLoad) = @_; #print STDERR "_applyStringRule: <$bLoad> <$sRule>\n"; if ($bLoad) { # remove whitespace so we don't have to fix parameter names # with leading or trailing whitespace my ($sClass, $sMethod, $sParams) = _splitStringRule($sRule); if ($sMethod) { my $aParams = _buildParams($sParams, $xData); #print STDERR "class=<$sClass> method=<$sMethod> " # ."params=<@$aParams>\n"; # if there is a bug in the evaluated code and the # bug occurs in array context, then () will be returned # rather than a scalar. Since callers to this method # always expect a scalar result, we force the scalar # context using scalar - thanks PerlMonk [Anno] # for the suggestion. return scalar eval("$sClass->$sMethod(\@\$aParams)"); } return $xData unless ref($xData); return bless($xData, $sClass); } return unblessCopy($xData); } sub _buildParams { my ($sParams, $xData) = @_; my $sDataRef = ref($xData); my @aParams; if (defined($sParams)) { my @aParamNames = split(/,/, $sParams); if ($sDataRef eq 'HASH') { foreach (@aParamNames) { #print STDERR "param=<$_>\n"; push @aParams, $xData->{$_}; } } elsif ($sDataRef eq 'ARRAY') { foreach my $sName (@aParamNames) { if ($sName =~ /^(\d+)\.\.$/) { $sName ="$1..$#$xData"; } elsif ($sName !~ /^\d+(?:\.\.\d+)?$/) { #bad data - skip it carp(sprintf($MSG_BAD_SLICE, $sParams, $sName)); next; } push @aParams, eval "\@\$xData[$sName]"; } } else { push @aParams, $xData; } } else { push @aParams, $xData; } return \@aParams; } sub _fixReferences { my ($xData, $hReplace, $aPath) = @_; # note: it is very important that pure scalar data be # returned immediately. Past versions had problems with # Storable because their data got stringified when it # was used as a hash key. This changed internal flags on # the data and caused both Storable and JSON to dump the # data as a string even though its internal memory representation # was as an integer. my $sRef = Scalar::Util::reftype($xData); return $xData unless $sRef; #stop traversal at circularities foreach (@$aPath) { return $xData if ($xData eq $_); } $aPath = [ @$aPath, $xData ]; #check to see if the reference has already been replaced my $idData = Scalar::Util::refaddr($xData); my $xReplace = $hReplace->{$idData}; return $xReplace if defined($xReplace); #print STDERR "_fixReferences: <$idData> <$xData>\n"; if ($sRef eq 'HASH') { while (my ($k,$v) = each(%$xData)) { $xData->{$k} = _fixReferences($v,$hReplace,$aPath); } } elsif ($sRef eq 'ARRAY') { for(my $i=0; $i<=$#$xData; $i++) { $xData->[$i]= _fixReferences($xData->[$i], $hReplace, $aPath); } } return $xData; } sub _splitStringRule { my ($sRule) = @_; $sRule =~ s/\s//g; return ($sRule =~ /^((?:\w+::)*\w+)(?:->(\w+)(?:\(([^\)]*)\))?)?$/); } #================================================================== # MODULE INITIALIZATION #================================================================== 1;
|
|---|