Veltro has asked for the wisdom of the Perl Monks concerning the following question:
Hello, I hope someone can help me with some ideas on this.
Quite often I end up working with big text files (~500k lines) which have configuration data that I want to change using a Perl program. The data files often don't have any official format. The structure of these kind of files are often similar and the content could look something like the following examples:
#ObjectType1
Param1: 8
Param2: SomeText
#ObjectType1.NestedObject
Param1: 3
Param2: SomeText
#ObjectType1
...
#ObjectType2
...
or
ObjectType1
{
Param1 = 8
Param2 = SomeText
NestedObject
{
Param1 = 3
Param2 = SomeText
}
}
ObjectType2
{
...
}
ObjectType1
{
...
}
Most of the time I want to do something like changing the values of parameters for a certain object type and leave all the other lines inside the data file 'untouched'. A very simplistic approach that I used looks like the next code example (second data example). It reads the file line by line and keeps track of which 'context' it is currently reading and acts depending on that context. It works fine (as long as the format does not change too much), however the more complex things that I want to do these kind of snippets tend to become very complex and difficult to maintain.
use strict ;
use warnings ;
my $file = "test" ;
open (my $fhi, "<", $file . ".dat" ) or die "Cannot open $file.dat\n"
+;
open (my $fho, ">", $file . "_out.dat" ) or die "Cannot open $file" .
+"_out.dat\n" ;
my $context = "" ;
while ( my $line = <$fhi> ) {
chomp $line ;
if ( $line =~ /ObjectType1/ ) {
$context = "ObjectType1" ;
}
if ( $line =~ /$\}/ ) {
$context = "" ;
}
if ( $context eq "ObjectType1" ) {
if ( $line =~ /Param1/ ) {
print $fho "Param1 = 0\n" ;
} elsif ( $line =~ /Param2/ ) {
print $fho "Param2 = SomeOtherText\n" ;
} else {
print $fho $line . "\n" ;
}
} else {
print $fho $line . "\n" ;
}
}
Does anyone know of a better or more generic way to do these kind of things? I am looking for a very simple approach (search and replace, not reading the entire data file to memory) where I can flexibly define a formula that is applied to a parameter within the scope of the context it is in.
Thanks, Veltro
edit:/\}/ => /$\}/
Re: Contextual find and replace large config file
by haukex (Archbishop) on Jan 02, 2019 at 17:27 UTC
|
It works fine (as long as the format does not change too much), however the more complex things that I want to do these kind of snippets tend to become very complex and difficult to maintain. ... I am looking for a very simple approach (search and replace, not reading the entire data file to memory)
It depends a lot on how much you can trust how strict the configuration file format is. For example, if you can be absolutely certain that, like in your example, the opening and closing braces are always on a line by themselves, then it'd be possible to implement a fairly simple line-by-line parser that keeps the names of the current sections on a stack, so that you can differentiate between different nested sections that happen to have the same name - I'm thinking something like the following:
But once things start getting more complex, I'd recommend a "real" parser instead. You can check the Config:: namespace to see if there happen to be any modules that match your config format. 500k lines isn't all too much to read into memory at once, IMO, unless you're running on some really memory-restricted machine. In the worst case, you can write a parser yourself, e.g. using the m/\G.../gc technique (there's one example in the Perl docs in perlop under "\G assertion"), or using a full grammar (Parse::RecDescent, Regexp::Grammars, Marpa::R2, ...).
Here's a solution using m/\G.../gc, followed by a Regexp::Grammars example (the latter only parses, it doesn't do the replacement). In both, I've made some assumptions about the file format, such as that a Name = Value pair must appear on a single line by itself, that the section names may or may not contain whitespace, and so on (I've chosen slightly different rules in both). What I like about these kind of solutions is that they're "just" regular expressions, and as long as one can deal with those, it should hopefully be understandable.
| [reply] [d/l] [select] |
|
This is great stuff haukex
I think that using Regexp::Grammars is probably the best solution, however I am getting this YACC feeling over me and think this kind of thing is programming on an entire different level. So currently I am looking at your second approach which I think will offer me the flexibility that I am looking for.
Actually I think this will help me to take this even one step further and build a more advanced configuration which will allow me to specify a filter and formulas to act on parameters. And for this I am thinking in the same lines as LanX (using a cache, separate functionality in functions etc. etc.).
I understand about 95% of the code, but I am still struggling with some of the regex items which are:
- Why (?:\z|\n) and not just \z when \z is 'up to and including \n'
- Why \h*\n* and not \s*
Thanks for your elaborate post
| [reply] [d/l] [select] |
|
Why (?:\z|\n) and not just \z when \z is 'up to and including \n'
Not quite, \z only ever matches at the very end of the string, whereas \Z also matches before the newline at the end of the string, and the meaning of $ is changed by the /m modifier to match before every newline or at the end of the string. When I want to express "match up to the end of this line", I sometimes prefer (?:\z|\n) over $+/m because the former explicitly consumes the \n.
Why \h*\n* and not \s*
Because /\s*/ would also match e.g. \t\n\t, which causes a following /^.../ to no longer match, since /\s*/ consumed the \t at the beginning of the line.
Update: Regarding the first point:
$ perl -MData::Dump -e 'dd split /($)/m, "x\ny\nz"'
("x", "", "\ny", "", "\nz")
$ perl -MData::Dump -e 'dd split /(\z|\n)/m, "x\ny\nz"'
("x", "\n", "y", "\n", "z")
| [reply] [d/l] [select] |
Re: Contextual find and replace large config file
by tybalt89 (Monsignor) on Jan 02, 2019 at 19:26 UTC
|
"The data files often don't have any official format." -> Then it's hopeless and you should give up. :)
Or
The following program works for your test case #2 (and some things you might have missed).
You should only have to change the "configuration section" to alter different things,
after, of course, fixing it to actually read and write files.
If it doesn't work on one of your large files, please show a small failed test case,
and we'll see what we can do :)
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1227916
use strict;
use warnings;
##################### configuration section
my $section = 'ObjectType1';
my %changes = ( Param1 => 0, Param2 => 'SomeOtherText', Param3 => 'Foo
+bar');
##################### end configuration section
my $allkeys = join '|', keys %changes;
my $pattern = qr/\b($allkeys)\b/;
local $/ = "\n}\n";
while( <DATA> )
{
if( /\b$section\b/ )
{
my @context;
print $& while
@context && $context[-1] eq $section && /\G(\h*$pattern = ).*\n/
+gc ?
"$1$changes{$2}\n" =~ /.*/s :
@context && /\G\h*\}\n/gc ? pop @context :
/\G\h*([\w ]+)\n\h*\{\n/gc ? push @context, $1 :
/\G.*\n/gc;
}
else
{
print;
}
}
__DATA__
ObjectType1
{
Param1 = 8
NestedObject
{
Param1 = 3
Param2 = SomeText
}
Param2 = SomeText
}
ObjectType2
{
Foo
{
Param1 = StaySame
ObjectType1
{
Param3 = ReplaceThis
}
}
}
ObjectType1
{
...
}
Outputs:
ObjectType1
{
Param1 = 0
NestedObject
{
Param1 = 3
Param2 = SomeText
}
Param2 = SomeOtherText
}
ObjectType2
{
Foo
{
Param1 = StaySame
ObjectType1
{
Param3 = Foobar
}
}
}
ObjectType1
{
...
}
I'm also curious about benchmark times vs any other solution (since I'm not going to generate a 500000 line test file).
| [reply] [d/l] [select] |
Re: Contextual find and replace large config file
by kschwab (Vicar) on Jan 02, 2019 at 18:34 UTC
|
"Does anyone know of a better or more generic way to do these kind of things?"
There's lots of choices for config files. JSON and YAML are popular. Your second example is pretty close to JSON already. It would look like this as JSON:
{
"ObjectType1": {
"Param1": 8,
"Param2": "SomeText",
"NestedObject": {
"Param1": 3,
"Param2": "SomeText"
}
},
"ObjectType2": {
"Param1": 10
}
}
There are perl modules to parse JSON, some streaming, if you really can't load it all into memory. There's also a really nice command line utility called "jq", see some examples here.
Note that JSON doesn't support comments, which is probably the biggest complaint about it as a configuration file format.
| [reply] [d/l] |
|
> Note that JSON doesn't support comments, which is probably the biggest complaint about it as a configuration file format.
I never noticed this - most probably because I never came into a situation to need it.
What's surprising me, is that JSON historically started as eval'ed JS object, so why did they skip the comment feature?
Especially since CSS inherited JS comments too.
So I did some research to find out that Douglas Crockford disabled it deliberately, because he wanted to prevent people from hiding data there. ...
... well, Douglas again. :/
Anyway, for config purpose I'd try split up the data into multiple JSON chunks and comment them, or resort to YAML, which allows JSON as subset.
--- # Comment
{
"name": "John Smith",
"age": 33
}
| [reply] [d/l] |
|
#!/usr/bin/env perl
use strict;
use warnings;
use JSON::Tiny qw(decode_json encode_json);
use Data::Dump;
my $conf = encode_json {
foo => qw(bar),
nose => qw(cuke),
comment => qw(RTFM)
};
my $hash = decode_json($conf);
dd $hash;
__END__
{ comment => "RTFM", foo => "bar", nose => "cuke" }
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] [select] |
|
Comments in most languages can appear anywhere where insignificant whitespace is possible. Your approach can't transform structures that comment both on the keys and values, as in
{
"name" /* represented as "shortname" in the DB */
: "John Doe" /* full name */,
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
Re: Contextual find and replace large config file
by LanX (Saint) on Jan 02, 2019 at 21:52 UTC
|
These are several different questions
First let me warn you that your code has an error
This will fail if you don't care about indentation:
if ( $line =~ /ObjectType1/ ) {
$context = "ObjectType1" ;
}
if ( $line =~ /\}/ ) {
$context = "" ;
}
Here you rather want to test for /$\}/ at lines start!
My suggestions
- separate parsing of syntax logic from processing of semantic logic
- parse all lines of an object into a cache ( a string or nested hashes) before handling it
- with nested objects use recursion
- keep track of the indentation level, like counting open and closed braces
- you should handle parsing errors in case the input is corrupted
- use functions and packages instead of piling up if cases
- use a function dispatcher if you need to handle semantics of different "ObjectTypes"
Like this you will get reusable and maintainable code!
edit
some may miss example code, but you got a generic answer for a generic question.
Feel free to pick some points and ask for clarification.
| [reply] [d/l] [select] |
|
Hi LanX,
Yes, I was actually aware of that error, since you mentioned it I edited the OP
Not strictly necessary to provide example code (plus others have already done so), I am just trying to redesign some code and trying to find a different approach. So your generic answer is welcome of course
The only thing is what you mean with your first suggestion (separate parsing...semantic logic). What do you mean with that? Do you mean parsing and gathering data first and then split the processing of that data into different function blocks or something else?
Thanks, Veltro
| [reply] |
|
> (separate parsing...semantic logic). What do you mean with that?
Your two examples seem to hold the same information (semantic) while having different format (syntax).
So better write parsers for the different formats which "cache" them in an intermediate format. These parsers should be ignorant about the meaning just concentrating on correctness.
The semantics - the meaning of the data - could be handled by one central module which only operates on the intermediate format. This module could be reused for all formats.
A possible intermediate format could be nested hashes
$cache = {
ObjectType1 => {
Param1 => 8,
Param2 => "SomeText",
NestedObject => {
Param1 => 3,
Param2 => "SomeText"
}
}
Of course this highly depends on the nature of your data,
like
- does order matter?
- are repeated elements allowed?
Using nested arrays may be better then°
And after transforming your data you can also have emitter modules to write them into a new out file.
Like this you are even capable to transform between different formats, or add new ones.
HTH! :)
edit
NB: this approach is also useful when handling only one input format, because you can cleanly separate code, hence much better maintain it.
update
°) or a mix of hashes and arrays. Or even using Perl objects blessing elements into different "ObjectTypes", ...
| [reply] [d/l] |
Re: Contextual find and replace large config file
by tybalt89 (Monsignor) on Jan 03, 2019 at 15:32 UTC
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1227916
use strict;
use warnings;
$SIG{__WARN__} = sub {die @_};
##################### configuration section
my %changes =
(
ObjectType1 => { Param1 => 0, Param2 => 'SomeOtherText' },
ObjectType4 => { Param3 => 'Replacement' },
Foo => { Param2 => 'FooChanged' },
);
##################### end configuration section
my $allcontexts = join '|', sort keys %changes;
my $contextpattern = qr/\b($allcontexts)\b/;
my %patterns;
for my $section (keys %changes)
{
my $all = join '|', keys %{ $changes{$section} };
$patterns{$section} =qr/\b($all)\b/;
}
local $/ = "\n}\n";
while( <DATA> )
{
if( /$contextpattern/ )
{
my @context;
print $& while
@context && $patterns{$context[-1]} &&
/\G(\h*$patterns{$context[-1]} = ).*\n/gc ?
"$1$changes{$context[-1]}{$2}\n" =~ /.*/s :
@context && /\G\h*\}\n/gc ? pop @context :
/\G\h*([\w ]+)\n\h*\{\n/gc ? push @context, $1 :
/\G.*\n/gc;
}
else
{
print;
}
}
__DATA__
ObjectType1
{
Param1 = 8
NestedObject
{
Param1 = 3
Param2 = SomeText
}
Param2 = SomeText
}
ObjectType2
{
Foo
{
Param1 = StaySame
Param2 = FooChange
ObjectType4
{
Param1 = DoNotReplaceThis
Param3 = ReplaceThis
}
}
}
ObjectType1
{
Param1 = ReplaceThis
Param3 = DoNotReplaceThis
Foo
{
Param1 = StaySame
ObjectType4
{
Param1 = DoNotReplaceThis
Param3 = ReplaceThis
}
}
}
| [reply] [d/l] |
Re: Contextual find and replace large config file
by Veltro (Hermit) on Jan 05, 2019 at 11:35 UTC
|
Thanks again for your input everyone.
With your help I am now able to change a foreign datafile like:
# comment
GlobalParam = 1
Object Type1 {
Param1 = Foo
NestedObject {
Param 1 = Bar
}
# just another comment
}
# comment
ObjectType2 {
Param1 = Quz = z
Param2 = 3
NestedObjectX {
Param1 = Baz
NestedObjectZ {
Param1 = Baz
}
}
NestedObjectY {
Param1 = 5
} }
by applying a filter like:
[
[
# Filter
{
'Object Type1' => {
'Param1' => [ "Foo" ],
},
'GlobalParam' => [ '1' ],
# 'Junk' => [ 'more junk' ], # Will break the filter
},
# Changes
{
'Object Type1' => {
'NestedObject' => {
'Param 1' => "\"Box\"",
},
},
}
],
[
# Filter
{
'Object Type1' => {
'Param1' => [ "Foo" ],
'NestedObject' => {
'Param 1' => [ "Box" ],
},
},
# 'GlobalParam' => [ '2' ], # Will disable this filter,
# but first filter is still
# applied
},
# Changes
{
'Object Type1' => {
'NestedObject' => {
'Param 1' => "\$curVal . \" Baz\"",
},
},
}
],
[
# Filter
{
'ObjectType2' => {
'Param2' => [ '1', '2', '3' ],
},
},
# Changes
{
'ObjectType2' => {
'NestedObjectY' => {
'Param1' => "\$curVal * 2",
},
},
}
],
] ;
Which changes the configured paramaters into:
# comment
GlobalParam = 1
Object Type1 {
Param1 = Foo
NestedObject {
Param 1 = Box Baz
}
# just another comment
}
# comment
ObjectType2 {
Param1 = Quz = z
Param2 = 3
NestedObjectX {
Param1 = Baz
NestedObjectZ {
Param1 = Baz
}
}
NestedObjectY {
Param1 = 10
} }
edit 2019 Jan 07: Without further testing of this particular program I have removed a '^' from my $re_comment = qr/ ^ \h* \# [^\n]* \n / ; and qr/ (?<pre> ^\h* )because it was killing the performance of this program.
code if you want:
| [reply] [d/l] [select] |
Re: Contextual find and replace large config file
by trippledubs (Deacon) on Jan 08, 2019 at 19:41 UTC
|
Not sure if this is too much or too little for you to plugin, but fun to learn some Parse::RecDescent. I could not figure out how to get the array list as the hash I wanted except to use unroll. Each parsing module requires it's own learning investment just browsing Regexp::Grammars from haukex's answer. If you need such a thing.
#!/usr/bin/env perl
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;
$::RD_ERRORS = 1;
$::RD_WARN = 1;
$::RD_HINT = 1;
#$::RD_TRACE = 1;
#$::RD_AUTOACTION = q { print Dumper \@item };
my $grammar = q{
{
use Data::Dumper;
sub unroll {
my @list = @{$_[0]};
my $unrolled;
for my $href (@list) {
for my $key (keys %{$href}) {
$unrolled->{$key} = $href->{$key};
}
}
return $unrolled;
};
}
Expression: Object(s) { $return = unroll($item[1]) }
Object: String '{' Param(s) '}'
{
$return = { $item[1] => unroll($item[3]) }
}
Param: String '=' String
{
$return = { $item[1] => $item[3] }
}
| Object(s)
{
$return = unroll($item[1])
}
String: /[\w\d]+/ { $return = $item[1] }
};
my $parser = Parse::RecDescent->new($grammar);
my $text = do { undef $/; <DATA> };
my $tree = $parser->Expression($text) or die $!;
$tree->{ObjectType1}{NestedObject}{DeeplyNested}{Param60} = 'tuna';
print Dumper $tree;
__DATA__
ObjectType1
{
Param1 = 8
Param2 = SomeText
NestedObject
{
Param1 = 3
Param2 = MoreText
DeeplyNested
{
Param50 = 500
Param60 = squid
}
}
}
ObjectType2
{
Param1 = 3
Param2 = 40
}
| [reply] [d/l] |
|
|