eyepopslikeamosquito has asked for the wisdom of the Perl Monks concerning the following question:
To give some context to my question, here is a test program:
use strict;
use warnings;
# Given an ini file, return a string containing section contents.
# (Note that ini file comment lines start with a ;)
sub get_section
{
my $fcontents = shift; # in: ini file contents string
my $section = shift; # in: section name to get
# Note that the regex below will find multiple sections;
# it's terminated by the start of a new section or end of file.
my $s = join( "", $fcontents =~ /^[ \t]*\[$section\](.*?)(?:\Z|^[ \
+t]*\[)/msg );
$s =~ s/^[ \t]+//mg; # remove leading whitespace from each line
$s =~ s/[ \t]+$//mg; # remove trailing whitespace from each line
$s =~ s/^\s+//; # remove leading whitespace
$s =~ s/\s+$//; # remove trailing whitespace
# Remove up to three trailing comment lines
$s =~ s/^;.*\Z//m; chomp $s;
$s =~ s/^;.*\Z//m; chomp $s;
$s =~ s/^;.*\Z//m; chomp $s;
return $s;
}
my $inifile_contents = <<'BUK_LIKES_SUNDIALS';
[MySection]
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; This is the heading for AnotherSection
[AnotherSection]
; another comment
asfld=69
BUK_LIKES_SUNDIALS
my $section1 = get_section( $inifile_contents, 'MySection' );
print "This is the contents of MySection -------\n$section1\n";
my $section2 = get_section( $inifile_contents, 'AnotherSection' );
print "This is the contents of AnotherSection -------\n$section2\n";
Running the test program above produces:
This is the contents of MySection -------
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
This is the contents of AnotherSection -------
; another comment
asfld=69
I added the code to remove trailing comment lines because I found,
in practice, that trailing comment lines in a section tended to be
unrelated to that section, rather they were usually header comment
lines for the following section.
Though general suggestions for code improvements are welcome,
my specific question relates to this eyesore:
# Remove up to three trailing comment lines
$s =~ s/^;.*\Z//m; chomp $s;
$s =~ s/^;.*\Z//m; chomp $s;
$s =~ s/^;.*\Z//m; chomp $s;
that I am currently using to remove trailing comment lines from a section.
What's a better way to do it?
Re: Removing multiple trailing comment lines from a string (\n)
by tye (Sage) on Dec 23, 2016 at 05:01 UTC
|
$s =~ s/\n;.*\Z// for 1..3;
| [reply] [d/l] |
Re: Removing multiple trailing comment lines from a string
by Laurent_R (Canon) on Dec 23, 2016 at 07:28 UTC
|
It would be good to see some oof your data, to try to figure out what you're doing and why.
Just a couple of comments on some code details, although it might be that it would be better to change it overall.
Why would you need this:
$s =~ s/^[ \t]+//mg; # remove leading whitespace from each line
$s =~ s/^\s+//; # remove leading whitespace
when the second line will do everything that the firstline is doing? (Same from trailing spaces).
Similarly, I don't see the reason to run the same pair of statements three times:
$s =~ s/^;.*\Z//m; chomp $s;
$s =~ s/^;.*\Z//m; chomp $s;
$s =~ s/^;.*\Z//m; chomp $s;
Once should be enough, no? And I doubt the chomp is useful here. | [reply] [d/l] [select] |
|
Why would you need this:
$s =~ s/^[ \t]+//mg; # remove leading whitespace from each line
$s =~ s/^\s+//; # remove leading whitespace
when the second line will do everything that the firstline is doing?
Remember that $s is a multi-line string.
So the regex mg modifier in the first regex above ensures that
each line in the multi-line string has leading spaces and
tabs removed from it.
The second regex, OTOH, does not have any modifiers, so it does not
apply to every line in the multi-line string; instead, it trims leading
whitespace (this time, including newlines) from the front of the multiline string --
trimming multiple blank lines from the front of a multi-line string,
for example.
| [reply] [d/l] [select] |
|
$s =~ s/^;.*\Z//m;
removes the contents of the last comment line of a multi-line string (note that \Z matches just the end of
the (multi-line) string, not the end of each line).
So if you ran it again without the chomp it would do nothing
because you've have already removed the last comment line!
The chomp is needed to remove the newline now sitting at the end of the string.
An alternative to chomp, suggested above by tye,
is to eschew the m modifier and remove the newline as part of the regex, like so:
s/\n;.*\Z//
| [reply] [d/l] [select] |
Re: Removing multiple trailing comment lines from a string
by Marshall (Canon) on Dec 23, 2016 at 11:12 UTC
|
I changed your 3 expressions into a while loop and I modified the test cases. Is this what you want?
use strict;
use warnings;
# Given an ini file, return a string containing section contents.
# (Note that ini file comment lines start with a ;)
sub get_section
{
my $fcontents = shift; # in: ini file contents string
my $section = shift; # in: section name to get
# Note that the regex below will find multiple sections;
# it's terminated by the start of a new section or end of file.
my $s = join( "", $fcontents =~ /^[ \t]*\[$section\](.*?)(?:\Z|^[ \
+t]*\[)/msg );
$s =~ s/^[ \t]+//mg; # remove leading whitespace from each line
$s =~ s/[ \t]+$//mg; # remove trailing whitespace from each line
$s =~ s/^\s+//; # remove leading whitespace
$s =~ s/\s+$//; # remove trailing whitespace
# Remove up to three trailing comment lines
#### Modified #####
while ($s =~ s/^;.*\Z//m){chomp $s;} ######## NEW ##########
chomp $s; ## NEW ##
return $s;
}
my $inifile_contents = <<'BUK_LIKES_SUNDIALS';
[MySection]
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; comment 1 inside the section
; comment 2 inside the section
fld3 =89
; trailer
; trailer 2
; trailer 3
; trailer 4
; trailer 5
; This is the heading for AnotherSection
[AnotherSection]
; another comment
asfld=69
BUK_LIKES_SUNDIALS
my $section1 = get_section( $inifile_contents, 'MySection' );
print "This is the contents of MySection -------\n$section1\n";
my $section2 = get_section( $inifile_contents, 'AnotherSection' );
print "This is the contents of AnotherSection -------\n$section2\n";
__END__
Prints:
This is the contents of MySection -------
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; comment 1 inside the section
; comment 2 inside the section
fld3 =89
This is the contents of AnotherSection -------
; another comment
asfld=69
| [reply] [d/l] |
|
Is this what you want?
Yep. Thanks.
I was hoping it could be done with a single regex,
but your solution looks good.
| [reply] |
|
#!/usr/bin/perl
use strict;
use warnings;
# Given an ini file, return a string containing section contents.
# (Note that ini file comment lines start with a ;)
sub get_section
{
my $fcontents = shift; # in: ini file contents string
my $section = shift; # in: section name to get
# Note that the regex below will find multiple sections;
# it's terminated by the start of a new section or end of file.
my $s = join( "", $fcontents =~ /^[ \t]*\[$section\](.*?)(?:\Z|^[ \
+t]*\[)/msg );
$s =~ s/^[ \t]+//mg; # remove leading whitespace from each line
$s =~ s/[ \t]+$//mg; # remove trailing whitespace from each line
$s =~ s/^\s+//; # remove leading whitespace
$s =~ s/\s+$//; # remove trailing whitespace
# Remove up to three trailing comment lines
$s =~ s/ ( \n (;.*)? ){1,3} \z //x; # hopefully less of an eyesore
return $s;
}
my $inifile_contents = <<'BUK_LIKES_SUNDIALS';
[MySection]
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; comment 1 inside the section
; comment 2 inside the section
fld3 =89
; trailer
; trailer 2
; trailer 3
; trailer 4
; trailer 5
; This is the heading for AnotherSection
[AnotherSection]
; another comment
asfld=69
BUK_LIKES_SUNDIALS
my $section1 = get_section( $inifile_contents, 'MySection' );
print "This is the contents of MySection -------\n$section1\n";
my $section2 = get_section( $inifile_contents, 'AnotherSection' );
print "This is the contents of AnotherSection -------\n$section2\n";
which prints
This is the contents of MySection -------
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; comment 1 inside the section
; comment 2 inside the section
fld3 =89
; trailer
; trailer 2
; trailer 3
; trailer 4
This is the contents of AnotherSection -------
; another comment
asfld=69
| [reply] [d/l] [select] |
|
Re: Removing multiple trailing comment lines from a string
by kcott (Archbishop) on Dec 23, 2016 at 17:19 UTC
|
G'day eyepopslikeamosquito,
I can see what you've done to run the tests;
however, I don't know how that translates to your real-world code.
The solution I've provided below is substantially different from your code.
The main differences are:
-
In your code, you pass the entire INI file as a string to &get_section
and parse it with a regex. You do this every time that function is called.
In my solution, I read the INI file once, clean it up and store the result in a hash (&get_clean_ini_data).
&get_section now only contains a single statement which accesses the data in that hash.
-
I've reduced your four whitespace removal regexes to a single regex: s/^\s*(.*?)\s*$/$1/.
-
There's only one other regex (for capturing the section name): /^\[([^]]+)/.
-
The removal of trailing comments is done by &strip_trailing_comments.
This simply works backwards through a section's lines;
removing comments until a non-comment line is found.
The index function, rather than a regex,
is used to identify these comments.
-
I've also added a [WhitespaceSection] with test data for checking the whitespace cleanup.
-
You could probably adapt this to your real-world requirements
by making the INI filename an argument to &get_clean_ini_data;
adding an open statement;
and changing <DATA> to <$ini_fh>.
I think everything else should work as is.
Here's "pm_1178405_ini_file_clean.pl":
#!/usr/bin/env perl -l
use strict;
use warnings;
get_clean_ini_data();
for (qw{MySection AnotherSection WhitespaceSection}) {
print "Contents of '$_':\n", get_section($_);
}
{
my %section_lines_for;
sub get_clean_ini_data {
my $current_section;
while (<DATA>) {
s/^\s*(.*?)\s*$/$1/;
next unless length;
if (/^\[([^]]+)/) {
my $new_section = $1;
strip_trailing_comments($current_section);
$current_section = $new_section;
}
else {
push @{$section_lines_for{$current_section}}, $_;
}
}
strip_trailing_comments($current_section);
}
sub strip_trailing_comments {
my $section = shift;
return unless defined $section;
for my $i (reverse 0 .. $#{$section_lines_for{$section}}) {
if (0 == index $section_lines_for{$section}[$i], ';') {
pop @{$section_lines_for{$section}};
}
else {
last;
}
}
}
sub get_section { join "\n", @{$section_lines_for{$_[0]}} }
}
__DATA__
[MySection]
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; This is the heading for AnotherSection
[AnotherSection]
; another comment
asfld=69
; Heading for WhitespaceSection
[WhitespaceSection]
; Comment starting with a tab
; Comment starting with a tab and a space
; Comment starting with a space
; Comment ending with a tab
; Comment ending with a tab and a space
; Comment ending with a space
; tab+space+comment+space+tab
; space+tab+comment+tab+space
qwe=rty
asd=fgh
; trailing 1
; tab + trailing 2
; space + trailing 3
; trailing 4
Output:
$ pm_1178405_ini_file_clean.pl
Contents of 'MySection':
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
Contents of 'AnotherSection':
; another comment
asfld=69
Contents of 'WhitespaceSection':
; Comment starting with a tab
; Comment starting with a tab and a space
; Comment starting with a space
; Comment ending with a tab
; Comment ending with a tab and a space
; Comment ending with a space
; tab+space+comment+space+tab
; space+tab+comment+tab+space
qwe=rty
asd=fgh
Because whitespace is difficult to see (especially differentiating spaces from tabs),
I passed the script and output through `cat -vet`.
I used this for my own testing; you might also find it useful.
The relevant parts are in the spoiler.
$ cat -vet pm_1178405_ini_file_clean.pl
...
__DATA__$
[MySection]$
; This is a comment line for MySection$
fld1 = 'value of field 1' $
fld2 = 42$
$
; This is the heading for AnotherSection$
[AnotherSection]$
; another comment$
asfld=69$
$
; Heading for WhitespaceSection$
[WhitespaceSection]$
^I; Comment starting with a tab$
^I ; Comment starting with a tab and a space$
; Comment starting with a space$
; Comment ending with a tab^I$
; Comment ending with a tab and a space^I $
; Comment ending with a space $
^I ; tab+space+comment+space+tab ^I$
^I; space+tab+comment+tab+space^I $
$
qwe=rty$
asd=fgh$
; trailing 1$
^I; tab + trailing 2$
; space + trailing 3$
; trailing 4$
$ pm_1178405_ini_file_clean.pl | cat -vet
Contents of 'MySection':$
; This is a comment line for MySection$
fld1 = 'value of field 1'$
fld2 = 42$
Contents of 'AnotherSection':$
; another comment$
asfld=69$
Contents of 'WhitespaceSection':$
; Comment starting with a tab$
; Comment starting with a tab and a space$
; Comment starting with a space$
; Comment ending with a tab$
; Comment ending with a tab and a space$
; Comment ending with a space$
; tab+space+comment+space+tab$
; space+tab+comment+tab+space$
qwe=rty$
asd=fgh$
| [reply] [d/l] [select] |
|
I don't know how that translates to your real-world code.
The solution I've provided below is substantially different from your code.
It's early days yet and requirements are a bit unclear right now.
I was after ideas for general approaches and you've provided some
interesting and useful code. Thanks.
| [reply] |
Re: Removing multiple trailing comment lines from a string (array of tokens)
by Anonymous Monk on Dec 23, 2016 at 04:22 UTC
|
Though general suggestions for code improvements are welcome, my specific question relates to this eyesore:
:) Sorry, its all eyesore :P
Mostly the problem is you're trying to s///ubstitute when you should be manipulating an array after m//atching , for example
push @stack, [ COMMENT => $1 ];
...
$stack[-1][0] eq 'COMMENT' and pop @stack for 1..3;
But freel free to do your own benchmarks $ perl -le " $_ = qq{;banana\n;ro\n;sham\n;bo\n}; print; s{(?:^;[^\r\n
+]*[\r\n]+){1,3}\Z}{}m; print; "
;banana
;ro
;sham
;bo
;banana
Thats just what I think at the moment :) | [reply] [d/l] [select] |
Re: Removing multiple trailing comment lines from a string
by Marshall (Canon) on Dec 26, 2016 at 01:48 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %HoA;
while (my $line=<DATA>)
{
my $next_line = process_section ($1) if $line =~ /\s*\[\s*(\S+)\s*\
+]/;
if (defined $next_line) {$line = $next_line; redo}
}
print Dumper \%HoA;
sub process_section
{
my $section = shift;
my $line;
while (defined ($line=<DATA>) and $line !~ /\s*\[\s*(\S+)\s*\]/)
{
chomp $line;
next if $line =~ /^\s*$/; # skip blank lines
push @{$HoA{$section}}, $line;
}
# delete the "trailing comments" in this [section] heading
my $comment;
while ($comment = pop @{$HoA{$section}} and $comment =~ /^\s*;/ ){}
+;
push @{$HoA{$section}}, $comment;
return $line;
}
=prints
$VAR1 = {
'AnotherSection' => [
'; another comment',
'asfld=69'
],
'MySection' => [
'; This is a comment line for MySection',
'fld1 = \'value of field 1\' ',
'fld2 = 42'
],
'WhitespaceSection' => [
' ; Comment starting with a tab'
+,
' ; Comment starting with a tab
+ and a space',
' ; Comment starting with a space',
'; Comment ending with a tab ',
'; Comment ending with a tab and a
+space ',
'; Comment ending with a space ',
' ; tab+space+comment+space+tab
+ ',
' ; space+tab+comment+tab+space
+ ',
'qwe=rty',
'asd=fgh'
]
};
=cut
__DATA__
; this is root section
a = 2
; some comment in root
b = 3
; some trailing comment in root
[MySection]
; This is a comment line for MySection
fld1 = 'value of field 1'
fld2 = 42
; This is the heading for AnotherSection
[AnotherSection]
; another comment
asfld=69
; Heading for WhitespaceSection
[WhitespaceSection]
; Comment starting with a tab
; Comment starting with a tab and a space
; Comment starting with a space
; Comment ending with a tab
; Comment ending with a tab and a space
; Comment ending with a space
; tab+space+comment+space+tab
; space+tab+comment+tab+space
qwe=rty
asd=fgh
; trailing 1
; tab + trailing 2
; space + trailing 3
; trailing 4
PS: my previous code explicitly allowed more than 3 trailing comments because I thought that was a requirement and was one of the "problems". I am also not sure why some .ini file comments should be ignored and others not? That is a strange thing to me.
| [reply] [d/l] |
|
|