Tricky Parsing Ko'an

jmmistrot has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks, Ok I have been meditating on this for awhile on and can't seem to find enlightenmant. I have been lurking here for years and usually find what I am looking for by searching around the site. This question is more of regexp problem than anything else I suppose. Here is what I am doing. I am trying to parse the following from a file:


//File Snippet Begin//
.
float myVarA 
<
float UIMin = 1;
float UIMax = 0;
float UIStep = .001;
string UIWidget = "slider";
> =  0.5f;

float myVarB = 1.0;

float4 myVarC = {1,0,0,1};
.
//File Snippet End//
[download]

What I would like to do is capture the info in a hash whose keys are the variable names listed above. For example lets say I parsed myVarA this is what I want to capture:

$vars{myVarA}->{type} = "float"
$vars{myVarA}->{value} = 0.5
$vars{myVarA}->{ui}->{min}=0
$vars{myVarA}->{ui}->{max}=1
$vars{myVarA}->{ui}->{step}=0.001
$vars{myVarA}->{ui}->{widget}="slider"
[download]

The difficulty (at least for me anyway) is capturing the "UI" information "if" it exists. I have tried different regex fencings but am unable to capture both variables with "UI" information and variables without. The real problem is the data-types inside the "UI" description delimited by the "<" and ">" characters. I also tried simplifying things by loading the whole file into a string marking up the data-types and throwing out the whitespace as follows:

#after dumping file to string
$file_str=~ s/(float|float2|float3|float4|string)\s+/$+\#/g; 
$file_str =~ s/\s*//g;
[download]

But even in this from I can't come up with a way to split off the UI description because of the data-types inside the carrots. I tried:

$file_str=~s/\<.*(string|float|float2|float3|float4).*\>//g;
[download]

as well as:

@lines = split/(float|float2|float3|float4)\#\<.*\>=.*\;/,$file_str;
[download]

but no go... I am hoping that the enlightened here can show me the way.. :)

Comment on Tricky Parsing Ko'an Select or Download Code

Replies are listed 'Best First'.
Re: Tricky Parsing Ko'an by BrowserUk (Patriarch) on Jan 22, 2007 at 03:18 UTC
There's too much missing information for a complete and tested solution, and if your 'structure' thingies can be nested, then this probably won't help, but on the basis of what you've posted this seems to come pretty close to your requirements: #! perl -slw use strict; use Data::Dumper; my $re = qr[ (string\|float\|float2\|float3\|float4) \s+ (\w+) \s* (?: < \s* ( [^>]+? ) \s* > \s* )? = \s* ( \S+ ) \s* ; ]x; my $data = do{ local $/; <DATA> }; my %vars; while( $data =~ m[$re]smg ) { my( $type, $name, $structure, $value ) = ( $1, $2, $3, $4 ); $vars{ $name } = { type => $type, value => $value }; if( defined $structure ) { while( $structure =~ m[$re]smg ) { my( $stype, $sname, $svalue ) = ( $1, $2, $4 ); if( my( $prefix, $rest ) = $sname =~ m[([A-Z]+)([A-Z][a-z] ++)] ) { $vars{ $name }{ $prefix }{ $rest } = $svalue; } } } } print Dumper \%vars; __DATA__ float myVarA < float UIMin = 1; float UIMax = 0; float UIStep = .001; string UIWidget = "slider"; > = 0.5f; float myVarB = 1.0; float4 myVarC = {1,0,0,1}; [download] It produces: `C:\test>junk2 $VAR1 = { 'myVarA' => { 'UI' => { 'Step' => '.001', 'Max' => '0', 'Widget' => '"slider"', 'Min' => '1' }, 'value' => '0.5f', 'type' => 'float' }, 'myVarC' => { 'value' => '{1,0,0,1}', 'type' => 'float4' }, 'myVarB' => { 'value' => '1.0', 'type' => 'float' } };` [download] You'd probably need to strip the comments around string initialisers and you don't specify how the initialiser `{1,0,0,1}` should be treated? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Tricky Parsing Ko'an by jmmistrot (Sexton) on Jan 25, 2007 at 06:26 UTC
BrowserUK, That output looks like exactly what I want! Thanks I will give it a run.. Cheers jm	[reply]
Re: Tricky Parsing Ko'an by kyle (Abbot) on Jan 22, 2007 at 03:34 UTC
This was quite a bit shorter before I threw in all the comments. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %vars; my $data = do{ local $/; <DATA> }; # first, get the items that have a set of UI elements while ( $data =~ s/ (\S+) # non-spaces -- the type \s+ # spaces (\S+) # non-spaces -- the name \s+ # spaces (including newlines) \< # opening angle thingy \s+ # spaces (including newlines) ( # start capturing the UI stuff (?: # grouping UI lines \S+ # non-spaces -- the UI stuff type \s+ # spaces UI\S+ # UI name \s=\s # equal sign, maybe spaces .* # anything except a newline \;\s* # traling semicolon, optional spaces (including new +line) )+ # UI lines repeat ) # capture all the UI lines \s* # optional spaces (including newline) \> # ending angle thingy \s=\s # equal sign, maybe spaces (.) # the main variable value \; # trailing semicolon //mx ) { my ( $main_type, $main_varname, $ui_stuff, $main_value ) = ( $1, $2, $3, $4 ); my $ui_ref = {}; # pull out the individual UI elements while ( $ui_stuff =~ s/ \S+ # non-spaces -- the ui element type \s+ # spaces UI(\S+) # UI element name \s=\s* # equal sign, maybe spaces (.) # UI element value \; # trailing semicolon //mx ) { my ( $name, $value ) = ( $1, $2 ); # lowercase the name $name =~ tr/A-Z/a-z/; $ui_ref->{$name} = $value; } $vars{ $main_varname }{ type } = $main_type; $vars{ $main_varname }{ value } = $main_value; $vars{ $main_varname }{ ui } = $ui_ref; } # get the rest of the stuff that doesn't have UI thingies while ( $data =~ s/ (\S+) # type \s+ # spaces (\S+) # name \s=\s* # equal, maybe spaces (.*) # value \; # trailing semicolon //mx ) { my ( $type, $name, $value ) = ( $1, $2, $3 ); $vars{$name}{value} = $value; $vars{$name}{type} = $type; } print Dumper( \%vars ); __DATA__ float myVarA < float UIMin = 1; float UIMax = 0; float UIStep = .001; string UIWidget = "slider"; > = 0.5f; float myVarB = 1.0; float4 myVarC = {1,0,0,1}; [download] Output: `$VAR1 = { 'myVarA' => { 'ui' => { 'step' => '.001', 'min' => '1', 'max' => '0', 'widget' => '"slider"' }, 'value' => '0.5f', 'type' => 'float' }, 'myVarC' => { 'value' => '{1,0,0,1}', 'type' => 'float4' }, 'myVarB' => { 'value' => '1.0', 'type' => 'float' } };` [download] There are a few loose ends still (the trailing "f" on the float value, for instance), but those should be easy to clean up.	[reply] [d/l] [select]
Re^2: Tricky Parsing Ko'an by jmmistrot (Sexton) on Jan 25, 2007 at 06:29 UTC
Awesome thanks!	[reply]
Re: Tricky Parsing Ko'an by dewey (Pilgrim) on Jan 22, 2007 at 03:33 UTC
Let me see if I have this straight. You have a file, some sections of which are enclosed in hoinkies. There is only one level, meaning that nothing will be inside two sets of hoinkies. The file starts with a non-hoinky character. Also, there will be no other uses of brackets; for example: `if(x<10){string oops = ">";}` [download] would not appear. Given these assumptions, you want to pull from this file the sections which are enclosed by the hoinkies. Is this accurate? Here is some code which I believe will solve this problem: `my @separated = split /<\|>/, $snippet;` [download] This way, every even entry in the array (starting with the 0th) will have non-UI-related text in it and every odd entry will have UI-text in it. Maybe not enlightened, but it works for me ;) If the problem is more complex than this, regexen may not be the greatest solution... parsing code is tough and some module may be able to help. Update: After working on this for a while... parsing code is so hard! What if a value contains <, or >, "semi;", /float(.*)=\;/, etc.? I don't have a module to suggest, but this seems analogous to html parsing-- doing it with regexen will be full of difficulties and exceptions, doing it with a module would be preferable. If you can change the format of the UI data it might also make your job easier. Good luck! ~dewey	[reply] [d/l] [select]
Re: Tricky Parsing Ko'an by BerntB (Deacon) on Jan 22, 2007 at 04:15 UTC
This isn't a quarter as elegant as the already posted solutions, but I take the shame of posting, since it might be relevant anyway. :-) This do a recursive traversal so it allows hierarchical variable specifications. (I have to learn to write better regexps. They are so damn elegant when they parse a large chunk in one go...) I didn't understand how to do the prefix things either (all uc chars except the first, as BrowserUK did it?) use strict; use warnings; use Carp; use Dumpvalue; my $d = new Dumpvalue; $d->compactDump(1); my($h) = {}; my(@lines) = <DATA>; parse_var_spec($h, \@lines); $d->dumpValue($h); # This is used recursively. sub parse_var_spec { my($hash, $lines, $in_sub_parse) = @_; # This is probably very ineffective, so rewrite to send an offset # along, instead. :-) while(scalar(@$lines)) { my $l = shift @$lines; next if $l =~ m"^\s#"; # Comment. next if $l =~ /^\s$/; # Empty line. # Handle return if parsed a subdef: if ($in_sub_parse && $l =~ s/^\s>//) { # Value will come directly after this. unshift @$lines, $l; return; } if ($l =~ s/^\s(float\|string)\s+([a-zA-Z][a-zA-Z0-9_])//) { # Got type and var name. my($type) = $1; my($var) = $2; $hash->{$var}->{type} = $type; # Are there subdata? if ($l =~ /^\s$/) { $l = $lines->[0]; croak "Bad def of $type '$var'" if !($l =~ s/^\s<\ +s//); $lines->[0] = $l; # Put back without '<'. my(%subs); # Recursive call that parse a bit different: parse_var_spec(\%subs, $lines, 1); # Setup sub-values: # Will it always be a 'UI' prefix? Should you look at +the # start of the variables? # Ah, do the details as an exercise. $hash->{$var}->{ui} = \%subs; $l = shift @$lines; } # Now, is it just a value? if ($l =~ /^\s=\s(.)\s;\s$/) { my($val) = $1; if ($type eq 'string') { if ($val =~ /^"(.)"$/) { # XXXX Extra parsing of string here for \n, et +c. $hash->{$var}->{value} = $1; } else { croak "Bad value '$val' for string '$var'"; } } elsif ($type eq 'float') { # XXXXX Parse out float value from $val better # than this :-) $val =~ s/f$//; $hash->{$var}->{value} = $val + 0.0; } else { # XXXXX etc. croak "Unknown type $type for var '$var'"; } } else { croak "Couldn't parse value from '$var', value '$l'"; } } } } __DATA__ string UIWidget = "slider"; float foohoo = 0.4532; string bahoo < string SUBtjo = "gznk"; string SUBhej = "sassa rassa"; float SUBba < float XXfoo = 4711f; string XXallan= "trutt trutt"; > = -122.22f; float SUBbaa = 23.23f; > = "hejsvjs"; string foo = "barf"; float myVarA < float UIMin = 1; float UIMax = 0; float UIStep = .001; string UIWidget = "slider"; > = 0.5f; float myVarB = 1.0; float4 myVarC = {1,0,0,1}; [download] The result of the run is: 'UIWidget' => HASH(0x8148b44) 'type' => 'string', 'value' => 'slider' 'bahoo' => HASH(0x81c994c) 'type' => 'string' 'ui' => HASH(0x81c96ac) 'SUBba' => HASH(0x81c9e44) 'type' => 'float' 'ui' => HASH(0x81c9ce8) 'XXallan' => HASH(0x81ca0fc) 'type' => 'string', 'value' => 'trutt trutt' 'XXfoo' => HASH(0x81c9b8c) 'type' => 'float', 'value' => 4711 'value' => '-122.22' 'SUBbaa' => HASH(0x81c9bbc) 'type' => 'float', 'value' => 23.23 'SUBhej' => HASH(0x81c9b68) 'type' => 'string', 'value' => 'sassa rassa' 'SUBtjo' => HASH(0x81c9b50) 'type' => 'string', 'value' => 'gznk' 'value' => 'hejsvjs' 'foo' => HASH(0x81c9be0) 'type' => 'string', 'value' => 'barf' 'foohoo' => HASH(0x81c9640) 'type' => 'float', 'value' => 0.4532 'myVarA' => HASH(0x81c9c04) 'type' => 'float' 'ui' => HASH(0x81c9964) 'UIMax' => HASH(0x81cee38) 'type' => 'float', 'value' => 0 'UIMin' => HASH(0x81c9c1c) 'type' => 'float', 'value' => 1 'UIStep' => HASH(0x81cee5c) 'type' => 'float', 'value' => 0.001 'UIWidget' => HASH(0x81cee80) 'type' => 'string', 'value' => 'slider' 'value' => 0.5 'myVarB' => HASH(0x81c9c64) 'type' => 'float', 'value' => 1 [download]	[reply] [d/l] [select]