Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Parsing bracket formatted file

by Stilgar (Initiate)
on Sep 23, 2022 at 23:13 UTC ( #11147096=perlquestion: print w/replies, xml ) Need Help??

Stilgar has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of files I need to pull into a data structure hash of hashes. The overall record is enclosed with brackets and is composed of KEY, VALUE pairs. The KEY is always text followed by a space, then the VALUE, which can be simple text or another bracketed sub-record. For example

sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } }

That's a simple one and there are arbitrarily nested records. It was originally formatted with newlines and spaces as well but that's been removed. So, for example, KEYS are usually separated by a newline, but sometimes just spaces. It's always some type of whitespace. I've been trying to parse it out with regex'es after slurping the file in a scalar, then tried writing a recursive function to do it. Any advice on the best way to approach it would be greatly appreciated!

Replies are listed 'Best First'.
Re: Parsing bracket formatted file (updated)
by choroba (Cardinal) on Sep 24, 2022 at 11:20 UTC
    Use Marpa::R2 (or a similar module) to write a parser for the format.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Marpa::R2; use Data::Dumper; my $input = q(sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } }); my $dsl = << '__DSL__'; lexeme default = latm => 1 :default ::= action => first Top ::= atom Attrs Struct action => top Attrs ::= atom Attrs action => merge | atom action => newlist Struct ::= ('{') Elements ('}') Elements ::= Element Elements action => merges | Element Element ::= atom Value action => struct | atom Struct action => struct Value ::= Struct | ('"') string ('"') | ('{ }') action => empty || ('{') List ('}') List ::= atom List action => merge | atom action => newlist :discard ~ [\s] string ~ [^"]+ atom ~ [^\s{}]+ __DSL__ sub top { +{ $_[1] => { attrs => $_[2], contents => $_[3] } } } sub first { $_[1] } sub empty { [] } sub newlist { [ $_[1] ] } sub merge { [ $_[1], @{ $_[2] } ] } sub struct { +{ $_[1] => $_[2] } } sub merges { +{ %{ $_[1] }, %{ $_[2] } } } my $grammar = 'Marpa::R2::Scanless::G'->new({ source => \$dsl }); my $recce = 'Marpa::R2::Scanless::R'->new({ grammar => $gram +mar, semantics_package => 'main +', }); $recce->read(\$input); use Data::Dumper; print Dumper($recce->value) =~ s/ / /gr;
    Output:
    $VAR1 = \{ 'sys' => { 'contents' => { 'property-template' => { 'account' => [], 'region' => { 'valid-values' => [ 'us-east-1', 'us-west-1' ] }, 'availability-zone' => { 'valid-values' => [ 'a', 'b', 'c', 'd' ] }, 'instance-type' => { 'valid-values' => [ 't2.micro', 't2.small', 't2.medium' ] } }, 'description' => 'The aws-ec2 parameters' }, 'attrs' => [ 'ecm', 'cloud-provider', '/Common/aws-ec2' ] } };

    Update: Fixed the missing + in the top rule, compacted the output, reverted the order of the merge rule.

    Update2: Added the default action.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Parsing bracket formatted file
by hv (Parson) on Sep 24, 2022 at 02:07 UTC

    This isn't quite sufficiently specified to write code for.

    1) You talk about a bracketed "record" and nested within it bracketed "sub-records", but the main record and its single direct sub-record appear to have key-value pairs (like a hash structure), while the further nested sub-records appear to be simple lists (like an array structure). How is it intended to distinguish the one from the other? Is it just that the key "valid-values" introduces a list while anything else introduces a hash?

    2) You talk about keys as "text" and values as "simple text" (if not a record), but there's an example of a quoted string ("The aws-ec2 parameters") and the unquoted string values include other punctuation marks. What characters can appear in unquoted text? What types of quoting can appear (double quotes, single quotes, other)? Can quote marks appear inside a quoted string, perhaps escaped somehow? Can quoted text include other whitespace, such as newlines?

    3) The example shows four bits of text preceding the main record (sys ecm cloud-provider /Common/aws-ec2), what is supposed to happen with that text, is it to be ignored?

    It would be useful to answer these questions, and confirm the answers by showing what data structure you would ideally like to see from this example (perhaps in the form of Data::Dumper output). Eg:

    $VAR1 = { 'description' => 'The aws-ec2 parameters', 'property-template' => { 'account' => {}, 'availability-zone' => { 'valid-values' => [ 'a', 'b', 'c', 'd' ] }, 'instance-type' => { 'valid-values' => [ 't2.micro', 't2.small', 't2.medium' ] }, 'region' => { 'valid-values' => [ 'us-east-1', 'us-west-1' ] } } };
Re: Parsing bracket formatted file (third update)
by tybalt89 (Monsignor) on Sep 24, 2022 at 09:43 UTC

    Something like this ?

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11147096 use warnings; my $testdata = <<END; sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } } END local $_ = $testdata; # NOTE expr() expect input in $_ my $parse = expr(); use Data::Dump 'dd'; dd $parse; use List::AllUtils qw( all ); sub fixhash { local $_ = shift; if( ref $_ eq 'ARRAY' ) { all { ref $_ eq 'HASH' } @$_ and return { map { map fixhash($_), % +$_ } @$_ }; return [ map fixhash($_), @$_ ]; } elsif( ref $_ eq 'HASH' ) { return { map fixhash($_), %$_ }; } else { return $_ }; } sub expr { /\G\s+/gc; my $e = []; $e = /\G\s+/gc ? $e : /\G([^{}\s]+) "(.*?)"/gc ? [ @$e, { $1 => $2 } ] : /\G([^{}\s]+) \{/gc ? [ @$e, { "$1" => (expr(), /\G\}/gc || die pos($_), ' b missing }', substr $_, pos($_))[0 +] } ] : /\G([^{}\s]+)/gc ? [ @$e, $1 ] : return fixhash($e) while 1; }

    Outputs:

    [ "sys", "ecm", "cloud-provider", { "/Common/aws-ec2" => { "description" => "The aws-ec2 parameters", "property-template" => { "account" => {}, "availability-zone" => { "valid-values" + => ["a" .. "d"] }, "instance-type" => { "valid-values" => +["t2.micro", "t2.small", "t2.medium"] }, "region" => { "valid-values" => ["us-ea +st-1", "us-west-1"] }, }, }, }, ]

    UPDATE: cleaned up fixhash and added "incomplete parse" check.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11147096 use warnings; my $testdata = <<END; sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } } END sub expr { /\G\s+/gc; my $e = []; $e = /\G\s+/gc ? $e : /\G([^{}\s]+) "(.*?)"/gc ? [ @$e, { $1 => $2 } ] : /\G([^{}\s]+) \{/gc ? [ @$e, { "$1" => (expr(), /\G\}/gc || die pos($_), ' missing }', substr $_, pos($_))[0] +} ] : /\G([^{}\s]+)/gc ? [ @$e, $1 ] : return $e while 1; } use List::AllUtils qw( all ); sub fixhash { local $_ = shift; return ref $_ eq 'ARRAY' ? ( all { ref $_ eq 'HASH' } @$_ ) ? { map fixhash($_), map %$_, @$_ } : [ map fixhash($_), @$_ ] : ref $_ eq 'HASH' ? { map fixhash($_), %$_ } : $_; } local $_ = $testdata; # NOTE expr() expects input in $_ my $parse = fixhash expr(); pos($_) < length $_ and die "incomplete parse ", substr $_, pos($_); use Data::Dump 'dd'; dd $parse;

    SECOND UPDATE: eliminating fixhash() by building it into expr()

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11147096 use warnings; use List::AllUtils qw( all ); my $testdata = <<END; sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } } END sub expr { /\G\s+/gc; my $e = []; $e = /\G\s+/gc ? $e : /\G([^{}\s]+) "(.*?)"/gc ? [ @$e, { $1 => $2 } ] : /\G([^{}\s]+) \{/gc ? [ @$e, { "$1" => (expr(), /\G\}/gc || die pos($_), ' missing } ', substr $_, pos($_))[0] } + ] : /\G([^{}\s]+)/gc ? [ @$e, $1 ] : return ref $e eq 'ARRAY' && ( all { ref $_ eq 'HASH' } @$e ) ? { map %$_, @$e } : $e while 1 } local $_ = $testdata; # NOTE expr() expects input in $_ my $parse = expr(); pos($_) < length $_ and die "incomplete parse ", substr $_, pos($_); $Data::Dump::LINEWIDTH = 26; use Data::Dump 'dd'; dd $parse;

    Outputs:

    [ "sys", "ecm", "cloud-provider", { "/Common/aws-ec2" => { "description" => "The aws-ec2 parameters", "property-template" => { "account" => {}, "availability-zone" => { "valid-values" => ["a" .. "d"], }, "instance-type" => { "valid-values" => [ "t2.micro", "t2.small", "t2.medium", ], }, "region" => { "valid-values" => ["us-east-1", "us-w +est-1"], }, }, }, }, ]

    THIRD UPDATE: factoring out a regex and shifting things around a little, maybe making things slightly clearer.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11147096 use warnings; sub expr { my $val = []; $val = /\G\s+/gc ? $val : /\G([^{}\s]+)/gc ? do { my $key = $1; [ @$val, /\G \{/gc ? { $key => (expr(), /\G\}/gc || die ' missing } ')[0] } : /\G "(.*?)"/gc ? { $key => $1 } : $key ] } : return ref $val eq 'ARRAY' && ( @$val == grep ref $_ eq 'HASH', @$ +val ) ? { map %$_, @$val } : $val while 1 } sub parse { local $_ = join '', @_; my $parse = expr; pos($_) == length $_ or die "incomplete parse stopped at ", substr $ +_, pos($_); return $parse; } my $parse = parse( <DATA> ); $Data::Dump::LINEWIDTH = 26; use Data::Dump 'dd'; dd $parse; __DATA__ sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } }
Re: Parsing bracket formatted file
by LanX (Sage) on Sep 24, 2022 at 10:49 UTC
    I have to second what hv already said, this description

    > The overall record is enclosed with brackets and is composed of KEY, VALUE pairs.

    doesn't fit the demonstrated sample. There are more types like LIST and QUOTED-STRINGS and especially the first "KEY" (?) sys ecm cloud-provider /Common/aws-ec2 is very confusing.

    You should better provide an SSCCE (update: especially the expected output)

    > Any advice on the best way to approach it would be greatly appreciated!

    Regarding recursive structures

    > the VALUE, which can be simple text or another bracketed sub-record.

    you might want to have a look at

    EDIT

    FWIW: I think after tr/-/_/ I could parse this as a non-strict Perl DSL, just by predefining the key-words as subs. But w/o better specification (whats a keyword, what a bareword/string) of the desired outcome, there is no point in attempting it.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Parsing bracket formatted file
by perlsherpa (Initiate) on Sep 26, 2022 at 05:56 UTC
    In the spirit of TIMTOWTDI, I figured the record looked similar enough to a Perl data structure to warrant some gratuitous use regular expressions. So I came up with the following:
    use strict; use warnings; # parse the follow text block (as given) - not trying # to make a general solution here... my $text = do { local $/; <DATA> }; $text =~ s/^(.+)\{/qw|$1|,\n\r{/; $text =~ s/ \{/ => {/g; $text =~ s/{(.+)}/\[qw\/$1\/\]/g; $text =~ s/}/},/g; $text =~ s/]/],/g; $text =~ s/(\w+-\w+) =/'$1' =/g; $text =~ s/ (\w+) "(.+)"/ $1 => "$2",/; $text = sprintf qq{[%s]}, $text; my $record1 = eval $text; require Data::Dumper; print Data::Dumper::Dumper($record1); __DATA__ sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } }
    Which outputs,
    $VAR1 = [ 'sys', 'ecm', 'cloud-provider', '/Common/aws-ec2', { 'property-template' => { 'account' => [], 'region' => { 'valid-values' => [ + 'us-east-1', + 'us-west-1' ] }, 'instance-type' => { 'valid-value +s' => [ + 't2.micro', + 't2.small', + 't2.medium' + ] }, 'availability-zone' => { 'valid-v +alues' => [ + 'a', + 'b', + 'c', + 'd' + ] } }, 'description' => 'The aws-ec2 parameters' } ];
Re: Parsing bracket formatted file
by Anonymous Monk on Sep 25, 2022 at 07:05 UTC

    Your example shows that "sub-records" can have odd number of elements and therefore can't be "composed of KEY, VALUE pairs". Obviously, some sub-records are arrays, not hashes. With such loose brief, there's room for interpretation whether to parse sub-record into array or hash. My attempt below assumes "keep arrays for odd number of elements or if unapproved keys were encountered". (E.g. for "us-east-1 us-west-1" sub-record, are they key-value pair or 2-element list?) Obviously, these rules can be adjusted, but idea was to let Perl parse input as Perl source, with only minimal text pre-processing, and always assume arrays. Afterwards, promote some arrays to hashes if they pass rules mentioned above.

    use strict; use warnings; use Data::Dumper; my $s = <<'END'; sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } } END my %valid_keys = map { $_, 1 } qw/ description property-template account availability-zone instance-type valid-values region /; $s =~ s/ (?| "([^"{}]+)" | (?:^|(?<=\s)) ([^\s{}]+) (?:$|(?=\s)) ) /"$1"/xg; $s =~ s/("|})\s\K/,/g; $s =~ tr/{}/[]/; print Dumper fix_aref([ eval $s ]); sub fix_aref { my $aref = shift; ref and $_ = fix_aref($_) for @$aref; return $aref unless $#$aref % 2; for ( 0 .. $#$aref ) { next if $_ % 2; return $aref unless exists $valid_keys{ ${$aref}[$_] } } return +{ @$aref } }

    Output:

    $VAR1 = [ 'sys', 'ecm', 'cloud-provider', '/Common/aws-ec2', { 'property-template' => { 'region' => { 'valid-values' => [ + 'us-east-1', + 'us-west-1' ] }, 'availability-zone' => { 'valid-v +alues' => [ + 'a', + 'b', + 'c', + 'd' + ] }, 'instance-type' => { 'valid-value +s' => [ + 't2.micro', + 't2.small', + 't2.medium' + ] }, 'account' => {} }, 'description' => 'The aws-ec2 parameters' } ];
Re: Parsing bracket formatted file
by Anonymous Monk on Sep 25, 2022 at 01:39 UTC
    You do not explain, where your trouble is.
    Anyways - recursion is fine.
    Would be eventually good to break the recursion into opening code (recognizing the key) and the value. Which can obviously be again key and value.

    This way you are able to handle line for line.
    And pass only references, passing the whole scalar could be resource consuming.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11147096]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2022-12-01 21:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?