Stilgar has asked for the wisdom of the Perl Monks concerning the following question:
I have a bunch of files I need to pull into a data structure hash of hashes. The overall record is enclosed with brackets and is composed of KEY, VALUE pairs. The KEY is always text followed by a space, then the VALUE, which can be simple text or another bracketed sub-record. For example
sys ecm cloud-provider /Common/aws-ec2 { description "The aws-ec2 parameters" property-template { account { } availability-zone { valid-values { a b c d } } instance-type { valid-values { t2.micro t2.small t2.medium } } region { valid-values { us-east-1 us-west-1 } } } }
That's a simple one and there are arbitrarily nested records. It was originally formatted with newlines and spaces as well but that's been removed. So, for example, KEYS are usually separated by a newline, but sometimes just spaces. It's always some type of whitespace. I've been trying to parse it out with regex'es after slurping the file in a scalar, then tried writing a recursive function to do it. Any advice on the best way to approach it would be greatly appreciated!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Parsing bracket formatted file (updated)
by choroba (Cardinal) on Sep 24, 2022 at 11:20 UTC | |
Output:
Update: Fixed the missing + in the top rule, compacted the output, reverted the order of the merge rule. Update2: Added the default action.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Parsing bracket formatted file
by hv (Prior) on Sep 24, 2022 at 02:07 UTC | |
This isn't quite sufficiently specified to write code for. 1) You talk about a bracketed "record" and nested within it bracketed "sub-records", but the main record and its single direct sub-record appear to have key-value pairs (like a hash structure), while the further nested sub-records appear to be simple lists (like an array structure). How is it intended to distinguish the one from the other? Is it just that the key "valid-values" introduces a list while anything else introduces a hash? 2) You talk about keys as "text" and values as "simple text" (if not a record), but there's an example of a quoted string ("The aws-ec2 parameters") and the unquoted string values include other punctuation marks. What characters can appear in unquoted text? What types of quoting can appear (double quotes, single quotes, other)? Can quote marks appear inside a quoted string, perhaps escaped somehow? Can quoted text include other whitespace, such as newlines? 3) The example shows four bits of text preceding the main record (sys ecm cloud-provider /Common/aws-ec2), what is supposed to happen with that text, is it to be ignored? It would be useful to answer these questions, and confirm the answers by showing what data structure you would ideally like to see from this example (perhaps in the form of Data::Dumper output). Eg:
| [reply] [d/l] [select] |
Re: Parsing bracket formatted file (third update)
by tybalt89 (Monsignor) on Sep 24, 2022 at 09:43 UTC | |
Something like this ?
Outputs:
UPDATE: cleaned up fixhash and added "incomplete parse" check.
SECOND UPDATE: eliminating fixhash() by building it into expr()
Outputs:
THIRD UPDATE: factoring out a regex and shifting things around a little, maybe making things slightly clearer.
| [reply] [d/l] [select] |
Re: Parsing bracket formatted file
by LanX (Saint) on Sep 24, 2022 at 10:49 UTC | |
> The overall record is enclosed with brackets and is composed of KEY, VALUE pairs. doesn't fit the demonstrated sample. There are more types like LIST and QUOTED-STRINGS and especially the first "KEY" (?) sys ecm cloud-provider /Common/aws-ec2 is very confusing. You should better provide an SSCCE (update: especially the expected output) > Any advice on the best way to approach it would be greatly appreciated! Regarding recursive structures > the VALUE, which can be simple text or another bracketed sub-record. you might want to have a look at EDITFWIW: I think after tr/-/_/ I could parse this as a non-strict Perl DSL, just by predefining the key-words as subs. But w/o better specification (whats a keyword, what a bareword/string) of the desired outcome, there is no point in attempting it. Cheers Rolf | [reply] [d/l] [select] |
Re: Parsing bracket formatted file
by perlsherpa (Novice) on Sep 26, 2022 at 05:56 UTC | |
Which outputs,
| [reply] [d/l] [select] |
Re: Parsing bracket formatted file
by Anonymous Monk on Sep 25, 2022 at 07:05 UTC | |
Your example shows that "sub-records" can have odd number of elements and therefore can't be "composed of KEY, VALUE pairs". Obviously, some sub-records are arrays, not hashes. With such loose brief, there's room for interpretation whether to parse sub-record into array or hash. My attempt below assumes "keep arrays for odd number of elements or if unapproved keys were encountered". (E.g. for "us-east-1 us-west-1" sub-record, are they key-value pair or 2-element list?) Obviously, these rules can be adjusted, but idea was to let Perl parse input as Perl source, with only minimal text pre-processing, and always assume arrays. Afterwards, promote some arrays to hashes if they pass rules mentioned above.
Output:
| [reply] [d/l] [select] |
Re: Parsing bracket formatted file
by Anonymous Monk on Sep 25, 2022 at 01:39 UTC | |
Anyways - recursion is fine. Would be eventually good to break the recursion into opening code (recognizing the key) and the value. Which can obviously be again key and value. This way you are able to handle line for line. And pass only references, passing the whole scalar could be resource consuming. | [reply] |