Re: A fix for merge keys in YAML::XS, YAML::Syck

Ran into the same problem of '<<' not being supported in YAML::XS today (in 2013 no less, for shame). Many thanks for your effort, this post proved helpful.

I extended your recursive solution to detect cycles, and also to allow for multiple hashes to be merged (in left to right order). The latter is useful for my intended purposes, but is apparently also supported by Ruby's YAML parser, so I figured it would make a nice addition. Ideally this work would be done to YAML::XS, but like you I have the time nor resources. I'm by no means a Perl programmer, I merely tinker, so please forgive any inadequacies.

From your benchmarks it appears that your breadth-first loop solution is the faster method, but the cycle detection relies on the recursive nature of the implementation to be able to detect that it has re-encountered a hash- or arrayref that it has already previously encountered while resolving an inheritance chain. It uses an a fairly blunt method of keeping track of these: by maintaining a stack of the encountered hash/arrayrefs throughout the recursion calls. I've made no effort towards optimizing this, nor the particular way of searching the stack -- if anybody can find better ways, I'd love to hear about them. Of course, the remarks you've made about the initial recursive implementation also remain valid.

For the cyclic references, I've chosen to keep them in the output. I.e., you would be able to infinitely traverse the data structure via the cycle. Alternatively, you can also error out and leave it up to the user to not have cycles in their YAML files. Ruby's implementation apparently simply removes the cycle and keeps on trucking, so I figured I might as well do the same (at least for the keep trucking part).

#!/usr/bin/perl
use strict;
use warnings;
use YAML::XS;
use Data::Dumper;

my $data;
{local $/;$data = <>};

sub mergekeys
{   
    return _mergekeys($_[0], []);
}

sub _mergekeys
{   
    my $ref = shift;
    my $resolveStack = shift; 
    my $reftype = ref $ref;

    # If this hash or array is already on the resolution stack, then s
+omewhere, a child data structure is trying to inherit from one of its
+ parents,
    # and hence by extension trying to inherit itself.
    if ($reftype =~ /HASH|ARRAY/ and (grep $_ == $ref, @$resolveStack)
+ > 0)
    {   
        # Halt and catch fire, or store the cyclic reference and not p
+rocess it further. Not complaining seems to
        # be the behaviour of ruby's YAML parser, so let's go for that
+.

        #die "Cyclic inheritance detected: ".($ref)." is already on th
+e resolution stack!\n" .
        #    "Dump of cyclic data structure (may have inheritance alre
+ady partially resolved):\n".Dumper($ref);
        return $ref;
    }
    
    if (ref($ref) eq 'HASH')
    {   
        push @$resolveStack, $ref;
        if (exists $ref->{'<<'})
        {   
            my $inherits = $ref->{'<<'}; # can be either a single href
+, or an array of hrefs
            die "Undefined value for merge key '<<' in ".Dumper($ref) 
+unless defined $inherits; # catch edge cases that YAML::XS won't catc
+h, like "<<: &foo"
            die "Merge key does not support merging non-hashmaps" unle
+ss ref($inherits) =~ /HASH|ARRAY/;
            $inherits = [$inherits] if ref($inherits) eq 'HASH'; # nor
+malize for further processing

            # For each of the hashes/arrays we're inheriting, have the
+m resolve their inheritance first before applying them onto ourselves
+.
            # Also, remove the '<<' reference only afterwards, since b
+y recursion these will have already been removed from our inheritees,
+                                                                     
+                              # and this also allows us to show the c
+yclic reference by dumping out the structure when we detect one.
            foreach my $inherit (@$inherits)
            {   
                $inherit = _mergekeys($inherit, $resolveStack);
                %$ref = (%$inherit, %$ref);
            }
            delete $ref->{'<<'};
        }

        _mergekeys($_, $resolveStack) for (values %$ref);
        die "Fatal error: imbalanced recursion stack in _mergekeys. Th
+is likely implies a programming error and/or a YAML file from hell." 
+unless pop(@$resolveStack) eq $ref;
    }
    elsif (ref($ref) eq 'ARRAY')
    {   
        push @$resolveStack, $ref;
        _mergekeys($_, $resolveStack) for (@$ref);
        die "Fatal error: imbalanced recursion stack in _mergekeys. Th
+is likely implies a programming error and/or a YAML file from hell." 
+unless pop(@$resolveStack) eq $ref;
    }

    return $ref;

}

my $yml = Load($data);
mergekeys $yml;
print Dumper $yml;
[download]

Here's the most convoluted test file I came up with:

---
key0: &id00
  zero:   0
key1: &id01
  <<: *id00
  subkey_a: asdf
  subkey_b: qwer
  Z:   "key1.Z"
key2: &id02
  everyday: apple
  X:   1
  Y:   2
  Z:   "key2.Z"
  A:
    - <<: *id00
      zero:  "00"
    - <<: *id01
    - <<: *id02   #cycle!
      everyday: shuffling
      Y:        3000
  <<:  *id00
key3: &id03
  <<: [*id01, *id02, *id01]
  subkey_a: foo
  subkey_c: bar
  deeper:
    challenger:   deep
    <<:           *id00
[download]

And the output it produces:

$ perl merge_keys.pl cyclic_test2.yml 
$VAR1 = {
          'key2' => {
                      'everyday' => 'apple',
                      'Z' => 'key2.Z',
                      'A' => [
                               {
                                 'zero' => '00'
                               },
                               {
                                 'Z' => 'key1.Z',
                                 'subkey_b' => 'qwer',
                                 'subkey_a' => 'asdf',
                                 'zero' => 0
                               },
                               {
                                 'A' => $VAR1->{'key2'}{'A'},
                                 'Z' => 'key2.Z',
                                 'everyday' => 'shuffling',
                                 'X' => 1,
                                 'zero' => 0,
                                 'Y' => 3000
                               }
                             ],
                      'X' => 1,
                      'zero' => 0,
                      'Y' => 2
                    },
          'key1' => {
                      'subkey_b' => 'qwer',
                      'Z' => 'key1.Z',
                      'subkey_a' => 'asdf',
                      'zero' => 0
                    },
          'key0' => {
                      'zero' => 0
                    },
          'key3' => {
                      'A' => $VAR1->{'key2'}{'A'},
                      'subkey_b' => 'qwer',
                      'subkey_a' => 'foo',
                      'X' => 1,
                      'Y' => 2,
                      'everyday' => 'apple',
                      'Z' => 'key1.Z',
                      'subkey_c' => 'bar',
                      'zero' => 0,
                      'deeper' => {
                                    'challenger' => 'deep',
                                    'zero' => 0
                                  }
                    }
        };
[download]

Notice that the cyclic references are maintained, and the values of "everyday" and "Y" are correctly overridden by their local values in key2.A[2] in the original YML file.

Note also that key3's final value for Z is the one inherited from key1, due to *id01 appearing a second time after the *id02 in key3's inheritance list.

Compare and contrast to ruby's result:

$ ruby -ryaml -rpp -e 'x = YAML.load_file("cyclic_test2.yml"); print P
+P.pp(x);'
{"key0"=>{"zero"=>0},
 "key1"=>{"zero"=>0, "subkey_a"=>"asdf", "subkey_b"=>"qwer", "Z"=>"key
+1.Z"},
 "key2"=>
  {"everyday"=>"apple",
   "X"=>1,
   "Y"=>2,
   "Z"=>"key2.Z",
   "A"=>
    [{"zero"=>"00"},
     {"zero"=>0, "subkey_a"=>"asdf", "subkey_b"=>"qwer", "Z"=>"key1.Z"
+},
     {"everyday"=>"shuffling", "X"=>1, "Y"=>3000, "Z"=>"key2.Z"}],
   "zero"=>0},
 "key3"=>
  {"zero"=>0,
   "subkey_a"=>"foo",
   "subkey_b"=>"qwer",
   "Z"=>"key1.Z",
   "everyday"=>"apple",
   "X"=>1,
   "Y"=>2,
   "A"=>
    [{"zero"=>"00"},
     {"zero"=>0, "subkey_a"=>"asdf", "subkey_b"=>"qwer", "Z"=>"key1.Z"
+},
     {"everyday"=>"shuffling", "X"=>1, "Y"=>3000, "Z"=>"key2.Z"}],
   "subkey_c"=>"bar",
   "deeper"=>{"challenger"=>"deep", "zero"=>0}}}
[download]

Notice that Ruby has (must have) performed the cyclic merge at least partially, as evidenced by the presence of the mapping "X"=>1 in the resulting hash for key2.A[2]. However, the mapping "zero"=>0 is missing from this value, which is present in Perl's result. Indeed, key2 inherits from id00, which imports the mapping "zero"=>0.

This difference may be due to the order in which merges are performed. Since the cyclic merge appears in the document before the key2's <<: *id00 does, maybe Ruby had not yet processed that merge at the time of doing the cyclic merge. I'm just guessing here, and I don't know which behaviour is correct, but at least you know about the difference.

Comment on Re: A fix for merge keys in YAML::XS, YAML::Syck Select or Download Code