peterp has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a large dataset in a configuration file which I plan on eventually porting to a more practical format, but for various reasons this will be complex, therefore for now I have to deal with it. Historically, multidimensional data structures have been stored as sets of selector style keys and values. For example:

0.foo.bar = 0foobar 0.foo.baz = 0foobaz 1.foo = 1foo equates to: [ { foo => { bar => '0foobar', baz => '0foobaz' } }, { foo => '1foo' } ]

The data is parsed by splitting the key into its individual components then running them through Data::Diver. The problem is I sometimes need to add data into the middle of the dataset and as a result have to increment index components within keys below this, which can be tedious to say the least. I decided the easiest quick fix would be to pre-process the keys beforehand and automatically calculate the array indexes. Thus the above dataset would look:

>.foo.bar = 0foobar <.foo.baz = 0foobaz >.foo = 1foo

Where '>' means "next index // 0" and '<' means "last index // 0", within scope of the dimension. I wrote the following demo:

use strict; use warnings; use Data::Diver qw#DiveVal DiveError#; use Data::Dumper qw#Dumper#; $/ = qq#\r\n#; my $state = { }; my $ref = undef; while ( my $line = <DATA> ) { next if ( $line =~ qr#^(\#|\s*$)# ); # ignore line if comment or b +lank. chomp $line; # remove newline. my ( $selector, $value ) = split qr#\s*=\s*#, $line; # (selector)= +(value). todo: unless escaped. next if ( not defined $selector ); # ignore line if no selector. my @selector = split qr#\.#, $selector; # (one).(two).(three)... t +odo: unless escaped. $ref //= ( $selector[0] =~ qr#^([><]|\d+)$# ) ? [ ] : { } ; _dive( $ref, \@selector, $value ); } print Dumper $state, $ref; sub _dive { my ( $ref, $selector, $value ) = @_; return if ( not defined $ref or not defined $selector or not scala +r @$selector ); # return if no ref or no selectees. my @selector_b = qw##; for my $selectee ( @$selector ) { if ( $selectee =~ qr#^([><])$# ) # if incognito selectee. todo +: unless escaped. { my $selector_b = join q#.#, @selector_b; if ( $1 eq q#># ) # incognito selectee is of increment typ +e. { if ( defined $state->{$selector_b} ) # we have seen th +is state before. { push @selector_b, $state->{$selector_b} += 1; # pu +sh current index + 1. } else { push @selector_b, $state->{$selector_b} = 0; # pus +h 0 (first) index. } } elsif ( $1 eq q#<# ) # incognito selectee is of maintain t +ype. { push @selector_b, $state->{$selector_b} //= 0; # push +current index or 0 (first) index. } } else # else non inconito selectee. { push @selector_b, $selectee; # push selectee. } } DiveVal( $ref, @selector_b ) = $value; my ( $error ) = DiveError( ); $error and warn $error; return 1; } __DATA__ >.name = john <.location = uk <.interests.> = programming <.interests.> = cycling >.name = laura <.location = <.interests.> = knitting <.interests.> = tennis <.interests.> = dancing >.name <.location = canada <.interests.>.> = dogs <.interests.<.> = horses <.interests.> = cars # test.error = blah

Output:

$VAR1 = { '2.interests' => 1, '' => 2, '2.interests.0' => 1, '0.interests' => 1, '1.interests' => 2 }; $VAR2 = [ { 'interests' => [ 'programming', 'cycling' ], 'location' => 'uk', 'name' => 'john' }, { 'interests' => [ 'knitting', 'tennis', 'dancing' ], 'location' => '', 'name' => 'laura' }, { 'interests' => [ [ 'dogs', 'horses' ], 'cars' ], 'location' => 'canada', 'name' => undef } ];

Is this a good approach, or is there a better alternative? Perhaps I should look into extending Data::Diver's DiveRef function? Can my code be improved, I don't mind a bit of golfing, but I'm concerned there could be a particular scenario that I've missed where my code could break?

Chris

Replies are listed 'Best First'.
Re: Dive data with automatic array indexing
by Loops (Curate) on Oct 27, 2014 at 05:25 UTC

    Hi Chris, I'm not sure exactly your end goal, or whether the $VAR1 data from the Dumper output you show is valuable to you at all. But if not you might consider using one of the config file parsers from CPAN. It might take some coercing to get one to conform to your needs, but you seem to have the flexibility to change the format of your files; an automated translation should be doable. Plus you'll get the escaping and quoting features, that you mention are lacking in your code.

    Here's a quick example

    use strict; use warnings; use Data::Dumper; use Config::Scoped; my $cs = Config::Scoped->new(); my @lines = ("data{all = [\n", <DATA>, "]}\n"); my $Config = $cs->parse(text => join('', @lines))->{data}->{all}; print Dumper($Config); __DATA__ { name = john location = uk interests = [ programming cycling ] } # An ignored comment { name = laura interests = [ knitting tennis dancing ] } { location = canada interests = [[ dogs horses ] cars] }

    Which displays:

    $VAR1 = [ { 'interests' => [ 'programming', 'cycling' ], 'location' => 'uk', 'name' => 'john' }, { 'interests' => [ 'knitting', 'tennis', 'dancing' ], 'name' => 'laura' }, { 'location' => 'canada', 'interests' => [ [ 'dogs', 'horses' ], 'cars' ] } ];

    If you absolutely need the keys that have undefined values as in your example output, the code can be changed to include them. It should be as easy to write a translator from your existing files into this format, as for the new syntax you are proposing. However, other CPAN options might be more to your liking.

      Hi Loops,

      Firstly thank you for your reply.

      Your alternative solution is brilliant and is an option I will highly consider when I get round porting to a more practical format. I didn't know such a configuration parser existed.

      As briefly mentioned in my question, there will be complexities in doing this, and is out of scope of the task I have been assigned.

      Some of the reasons behind this are:

      - This was assigned as a quickfix task, which can be revisted when there is more time in a few months. The more I have to change, the more time in development and testing.
      - There are multiple configuration files containing unrelated data but are all parsed via the same parser. I do not have permission to update these configuration files just yet. Although implementing two parsers might be an option.
      - The particular configuration file I am dealing with contains unrelated data, notably Log::Log4Perl configuration data which as far as I am aware must be in their documented format (selector based). Mixing formats might pose an issue. Although looking at the documentation it looks as though you can init with a ref which could be derived from the configuration file.
      - The same parser is used to process application/x-www-form-urlencoded multidimensional http parameters, the task includes implementing automatic array indexing of these too whilst retaining their existing explicit array indexing usage. Therefore either way I'll have to perfect the above approach.
      etc

      Lastly, I haven't got around to properly going through the production code that handles this stuff just yet, the demo was just something I devised in my own time whilst fresh in my head. I certainly hope the todos I marked have already been implemented!

      Chris

        Okay. Have a better appreciation of your constraints. Have played around with your code a bit and I think it covers all of the cases that you showed well. There may be a bit of an opportunity to make the config file syntax a little less visually cluttered though.

        If you can trust that there are empty lines between the blocks and not within, you can strip off the need for the > or < symbols at the start of each line. Also if you just assume an empty selector means >, then you only have to place the < selectors in cases where you don't want to nudge the index forward. Check out the example DATA below the code. The code will still parse your original example as well if you don't like these changes.

        All bugs are mine after twisting your code around like this:

        use strict; use warnings; use Data::Diver qw( DiveVal DiveError ); use Data::Dumper qw( Dumper ); my $state = { }; my $ref = [ ]; my $prefix = '<'; while (<DATA>) { chomp; next if /^\s*#.*$/; if (/^\s*$/) { $prefix = '>'; next; }; my ($selector, $value) = split /\s*=\s*/; next unless defined $selector; my @selector = split /\./, $selector =~ s/\.$/.>/r; unshift(@selector, $prefix) unless $selector =~ /^[><]\./; _dive( $ref, \@selector, $value ); $prefix = '<'; } print Dumper $ref; sub _dive { my ( $ref, $selector, $value ) = @_; return unless defined $ref and defined $selector and scalar @$sele +ctor; my @selector_b; for ( @$selector ) { if ( /^(>|<|)$/ ) { my $selector_b = join '.', @selector_b; if ( $1 eq '<') { push @selector_b, $state->{$selector_b} //= 0; } else { $state->{$selector_b} += defined $state->{$selector_b} + ? 1 : 0; push @selector_b, $state->{$selector_b}; } } else { push @selector_b, $_; } } DiveVal( $ref, @selector_b ) = $value; my ( $error ) = DiveError( ); $error and warn $error; return 1; } __DATA__ name = john location = uk interests. = programming interests. = cycling name = laura location = interests. = knitting interests. = tennis interests. = dancing name location = canada interests.. = dogs interests.<. = horses interests. = cars # test.error = blah