Parsing a complex config file

solitaryrpr has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I'm rusty as hell and am stumped as to the best way to accomplish this. I need to parse a fairly complex config file a create a hash based on the contents.

The only argument to this script should be the path to the libnames.parm file. This script will then parse that file and validate the paths that have been defined within it.

The goal is to create a hash array of every library in the libnames.parm file. We have to account for every option.

The structure will, hopefully, end up looking like:

'domain1' => [
              {
                'libname' => 'foo',
                'pathname' => '/path/to/metadata',
                'owner' => 'someuser',
                'libaclinherit' => 'yes|no',
                'dynlock' => 'yes|no',
                'options|roptions' => [
                                       {
                                         'datapath' => [
                                                       '/data/path01',
                                                       '/data/path02',
                                                       '/data/path03',
                                                       ...
                                                       ],
                                        'indexpath' => [
                                                        '/indx/path1',
                                                        '/indx/path2',
                                                        '/indx/path3',
                                                       ...
                                                      ],
                                        'workpath' => [
                                                       '/work/path1',
                                                       '/work/path2',
                                                       '/work/path3',
                                                       ...
                                                      ],
                                        'metapath' => [
                                                       '/meta/path1',
                                                       '/meta/path2',
                                                       '/meta/path3',
                                                      ...
                                                      ]
                                       }
                                      ]
              }
             ],
'domain2' => [
              {
                'libname' => 'bar',
                'pathname' => '/path/to/metadata',
                'owner' => 'someuser',
                'libaclinherit' => 'yes|no',
                'dynlock' => 'yes|no'
                'options|roptions' => [
                                       {
                                        'datapath' => [
                                                       '/data/path01',
                                                       '/data/path02',
                                                       '/data/path03',
                                                       ...
                                                      ],
                                         'indexpath' => [
                                                        '/indx/path1',
                                                        '/indx/path2',
                                                        '/indx/path3',
                                                        ...
                                                       ],
                                        'workpath' => [
                                                       '/work/path1',
                                                       '/work/path2',
                                                       '/work/path3',
                                                       ...
                                                      ],
                                        'metapath' => [
                                                       '/meta/path1',
                                                       '/meta/path2',
                                                       '/meta/path3',
                                                       ...
                                                      ]
                                      }
                                     ]
             }
            ]
[download]

this would have to be generated from the config file with the format:

libname=foo pathname=/path/to/metadata/foo owner=someuser libaclinherit=no dynlock=no
   roptions="
       datapath=('/data/path1'
                 '/data/path2'
                 '/data/path3'
                 ...)
       indexpath=('/indx/path1'
                  '/indx/path2'
                  '/indx/path3'
                  ...)
       workpath=('/work/path1'
                 '/work/path2'
                 '/work/path3'
                 ...)
       metapath=('/meta/path1'
                 '/meta/path2'
                 '/meta/path3'
                 ...)";

libname=bar pathname=/path/to/metadata/bar owner=someuser libaclinherit=no dynlock=no
   roptions="
       datapath=('/data/path1'
                 '/data/path2'
                 '/data/path3'
                 ...)
       indexpath=('/indx/path1'
                  '/indx/path2'
                  '/indx/path3'
                  ...)
       workpath=('/work/path1'
                 '/work/path2'
                 '/work/path3'
                 ...)
       metapath=('/meta/path1'
                 '/meta/path2'
                 '/meta/path3'
                 ...)";

This parsing has to be able to handle the fact that everything after 'pathname' is optional. The simplest entry being:

libname=foobar pathname=/path/to/metadata;

The most complex is the examples above. It should also be flexible enough to handle new options without the need to recode the parser (dynamic hash creation).

Each block of the config begins with libname and ends with ';'. I've managed to parse it into blocks and dump the entire block into an array (libname=foo...;, libname=bar...;).

I can handle the simple case well enough...it's a simple split on =...it's the roptions part that has me stumped. I'm hoping for something elegant (I can brute force it I know know perl can do this more nicely)...when first looking at the config file I thought, this will be easy...how many late nights have begun with that statement?

Comment on Parsing a complex config file Download Code

Replies are listed 'Best First'.
Re: Parsing a complex config file by ikegami (Patriarch) on Jul 12, 2006 at 05:45 UTC
I don't have time to code a solution write now, but I have time to comment on the structure you wish the parser to output. What's the point of having an array that always contains execatly one hash reference? `[ { ... } ]` should be replaced with `{ ... }`. Why does the structure contain more information than the configuartion file? Specifically, indexes 0 and 1 acquired names domain1 and domain2 during parsing. This doesn't add any usefulness, and is misleading because domain2 could appear before domain1 when iterating over the hash. If you need to convert indexes into numbered names for output/display purposes, do it in the output/display code. The following structure contains all the information as yours, but is more concise. Simpler is almost always better. `@domains = ( { 'libname' => 'foo', ... 'options\|roptions' => { 'datapath' => [ ... ], 'indexpath' => [ ... ], 'workpath' => [ ... ], 'metapath' => [ ... ], }, }, { 'libname' => 'bar', ... 'options\|roptions' => { 'datapath' => [ ... ], 'indexpath' => [ ... ], 'workpath' => [ ... ], 'metapath' => [ ... ], }, }, );` [download]	[reply] [d/l] [select]
Re^2: Parsing a complex config file by Anonymous Monk on Jul 12, 2006 at 13:29 UTC
I agree, simpler is better... no point now that I think about it...I agree. that would be due to me being rusty...this makes sense. I'll be adjusting my structure to reflect this. thanks	[reply]
Re: Parsing a complex config file by Zaxo (Archbishop) on Jul 12, 2006 at 05:59 UTC
Why are your hash references all contained in one-element arrays? Can there be more than one hash in a level? I don't see anything about the data format which would support that. Look into paragraph mode for reading each chunk of data at the "domainN" level. Is there anything in the data which names the domains for you? If this is some ad-hoc moving target of a data format, you're in trouble. If there is a real grammar for it, you might look into Parse::RecDescent or one of the other parser generators. The appearance of balanced quotes and parentheses suggests a grammar of some sort, but make a simple regex based parser difficult. After Compline, Zaxo	[reply]
Re^2: Parsing a complex config file by solitaryrpr (Acolyte) on Jul 12, 2006 at 14:00 UTC
Only the libname='foo' component sets the uniqueness. Based on the previous response (and the following), it's obvious I was overthinking this. I'm trying to avoid using additional modules where possible (I can't guarantee the module availabilty...ever tried to get an up to date Active State Perl module?). There is definitely a set grammar to the config file...your last sentence summed up my issue concisely.	[reply]
Re: Parsing a complex config file by GrandFather (Saint) on Jul 12, 2006 at 09:19 UTC
The following parses the data into a structure somewhat like the one you describe. It doesn't insert extranious single element arrays and it is not robust against nested quoted strings, but it may be a useful starting point for your actual application. use warnings; use strict; use Data::Dump::Streamer; my @libs; local $/ = 'libname='; # Read a record at a time while (<DATA>) { chomp; next if ! length; # Skip blank lines s/\n\|\r/ /g; # Remove conventional line end characters next if ! s/(\S+)\s//; my $str = $_; my %record; $record{libname} = $1; while ($str =~ /=/) { # Process an option last if ! ($str =~ s/\s(\S+?)\s=\s//); my $opName = $1; my $opValue; if ($str =~ s/^\s"([^"])"\s//) { # Complicated option value $opValue = parseSubOps ($1); } elsif ($str =~ s/\s(\S+)\s//) { # Simple option $opValue = $1; } $record{$opName} = $opValue; } push @libs, {%record}; } Dump (\@libs); sub parseSubOps { my $str = shift; my %subOps; while ($str =~ /=/) { # Process a sub-option last if ! ($str =~ s/\s(\S+)\s* = \s$\s ([^)]?) $\s//x) +; my $name = $1; my @values = $2 =~ /'([^']*?)'/g; $subOps{$name} = \@values; } return \%subOps; } __DATA__ libname=foo pathname=/path/to/metadata/foo owner=someuser libaclinheri +t=no dynlock=no roptions=" datapath=('/data/path1' '/data/path2' '/data/path3' ...) indexpath=('/indx/path1' '/indx/path2' '/indx/path3' ...) workpath=('/work/path1' '/work/path2' '/work/path3' ...) metapath=('/meta/path1' '/meta/path2' '/meta/path3' ...)"; libname=bar pathname=/path/to/metadata/bar owner=someuser libaclinheri +t=no dynlock=no roptions=" datapath=('/data/path1' '/data/path2' '/data/path3' ...) indexpath=('/indx/path1' '/indx/path2' '/indx/path3' ...) workpath=('/work/path1' '/work/path2' '/work/path3' ...) metapath=('/meta/path1' '/meta/path2' '/meta/path3' ...)"; libname=foobar pathname=/path/to/metadata; [download] Prints: $ARRAY1 = [ { dynlock => 'no', libaclinherit => 'no', libname => 'foo', owner => 'someuser', pathname => '/path/to/metadata/foo', roptions => { datapath => [ '/data/path1', '/data/path2', '/data/path3' ], indexpath => [ '/indx/path1', '/indx/path2', '/indx/path3' ], metapath => [ '/meta/path1', '/meta/path2', '/meta/path3' ], workpath => [ '/work/path1', '/work/path2', '/work/path3' ] } }, { dynlock => 'no', libaclinherit => 'no', libname => 'bar', owner => 'someuser', pathname => '/path/to/metadata/bar', roptions => { datapath => [ '/data/path1', '/data/path2', '/data/path3' ], indexpath => [ '/indx/path1', '/indx/path2', '/indx/path3' ], metapath => [ '/meta/path1', '/meta/path2', '/meta/path3' ], workpath => [ '/work/path1', '/work/path2', '/work/path3' ] } }, { libname => 'foobar', pathname => '/path/to/metadata;' } ]; [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Parsing a complex config file by solitaryrpr (Acolyte) on Jul 12, 2006 at 14:08 UTC
This is awesome...I had to look up '$/' and I'm glad I did...talk about simplifying a process. The complex example I used is every option assigned in the config file so (as it stands right now) that's as complex as it gets but I think this would address new additions as well. thanks	[reply]
Re^3: Parsing a complex config file by solitaryrpr (Acolyte) on Jul 12, 2006 at 17:33 UTC
This rocked...worked straight up on the file...I did modify it so that I created a HoH instead with the primary key being the libname. Now I only need to make a function out of it and my life is gravy.	[reply]
Re: Parsing a complex config file by dimar (Curate) on Jul 12, 2006 at 15:41 UTC
Something earlier in this thread seemed to indicate that you have some control over the syntax and formatting contained in the config file. Even if this is not the case, you will certainly save yourself a lot of time if you simply use a pre-existing data serialization format, instead of inventing your own. (see e.g., YAML, XML, JSON, WDDX). The benefits of using a pre-established syntax are too numerous to mention here, but the only disadvantage is that you don't get the 'personal growth' experience of going through the tedium of the inventing/parsing/debugging cycle yourself. Learning how to write your own parsing code can be an educational experience, but do you really want to go through all that if all you are doing is reading config files? Even if you cannot choose a pre-established syntax, you still are probably better off by simply converting the "custom" syntax into a pre-existing one. For example, here is some code that converts your sample data into YAML. ### begin_: init perl use strict; use warnings; ### p__: standard perl libraries use YAML; use Data::Dumper; ### begin_: get sample data my $sRaw = join '',<DATA>; ### begin_: convert to YAML for ($sRaw){ ### p__: scrub the top part s/libname=/\n- domain: begin\n libname: /gms; s/pathname=([^\s]+)/\n pathname: "$1"/gms; s/owner=([^\s]+)/\n owner: "$1"/gms; s/libaclinherit=([^\s]+)/\n libaclinherit: "$1"/gms; s/dynlock=([^\s]+)/\n dynlock: "$1"/gms; s/roptions=\x22//gms; ### p__: scrub the roption stuff for my $sOpt qw(datapath indexpath workpath metapath){ s/\s+$sOpt=\x28([^\x29]+)\x29/\n $sOpt: [$1]/gms; } ### p__: scrub the oddball stuff s/\n^\x20{4,}/,/gms; s/,\x2e{3}//gms; s/\x22;//gms; s/\x5d[\x2c\x20]+/\x5d/gms; $_ .= "\n"; }; ### begin_: display result ### p__: show raw converted to yaml print $sRaw; print "\n---\n"; ### p__: show yaml converted to perl my $oData = YAML::Load($sRaw); print Data::Dumper->Dump([$oData], [qw(oDomains)]); ### begin_: end_perl 1; __END__ libname=foo pathname=/path/to/metadata/foo owner=someuser libaclinheri +t=no dynlock=no roptions=" datapath=('/data/path1' '/data/path2' '/data/path3' ...) indexpath=('/indx/path1' '/indx/path2' '/indx/path3' ...) workpath=('/work/path1' '/work/path2' '/work/path3' ...) metapath=('/meta/path1' '/meta/path2' '/meta/path3' ...)"; libname=foo pathname=/path/to/metadata/foo owner=someuser libaclinheri +t=no dynlock=no roptions=" datapath=('/data/path1' '/data/path2' '/data/path3' ...) indexpath=('/indx/path1' '/indx/path2' '/indx/path3' ...) workpath=('/work/path1' '/work/path2' '/work/path3' ...) metapath=('/meta/path1' '/meta/path2' '/meta/path3' ...)"; [download] The Raw-To-YAML conversion gives you something like this: - domain: begin libname: foo pathname: "/path/to/metadata/foo" owner: "someuser" libaclinherit: "no" dynlock: "no" datapath: ['/data/path1','/data/path2','/data/path3'] indexpath: ['/indx/path1','/indx/path2','/indx/path3'] workpath: ['/work/path1','/work/path2','/work/path3'] metapath: ['/meta/path1','/meta/path2','/meta/path3'] - domain: begin libname: foo pathname: "/path/to/metadata/foo" owner: "someuser" libaclinherit: "no" dynlock: "no" datapath: ['/data/path1','/data/path2','/data/path3'] indexpath: ['/indx/path1','/indx/path2','/indx/path3'] workpath: ['/work/path1','/work/path2','/work/path3'] metapath: ['/meta/path1','/meta/path2','/meta/path3'] [download] The YAML-To-Perl conversion gives you something like this: (this is all done for you by YAML, no parsing necessary) `$oDomains = [ { 'owner' => 'someuser', 'indexpath' => [ '/indx/path1', '/indx/path2', '/indx/path3' ], 'libaclinherit' => 'no', 'libname' => 'foo', 'workpath' => [...] ... ];` [download] Even if you cannot store the config files as YAML, you can still use simple regex code to convert them. Sure, you will still have to do a little tweaking and debugging to make sure the YAML output is well-formed, but the leverage you get makes the task much simpler, especially if your perl skills are a tad rusty. =oQDlNWYsBHI5JXZ2VGIulGIlJXYgQkUPxEIlhGdgY2bgMXZ5VGIlhGV	[reply] [d/l] [select]
Re^2: Parsing a complex config file by solitaryrpr (Acolyte) on Jul 12, 2006 at 18:46 UTC
While I have some general control over what goes in the config file, the structure is pretty much defined for me. Grandfather pegged it. Would that I were that able.	[reply]