Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Removing multiple trailing comment lines from a string

by eyepopslikeamosquito (Bishop)
on Dec 23, 2016 at 04:07 UTC ( #1178405=perlquestion: print w/replies, xml ) Need Help??

eyepopslikeamosquito has asked for the wisdom of the Perl Monks concerning the following question:

To give some context to my question, here is a test program:

use strict; use warnings; # Given an ini file, return a string containing section contents. # (Note that ini file comment lines start with a ;) sub get_section { my $fcontents = shift; # in: ini file contents string my $section = shift; # in: section name to get # Note that the regex below will find multiple sections; # it's terminated by the start of a new section or end of file. my $s = join( "", $fcontents =~ /^[ \t]*\[$section\](.*?)(?:\Z|^[ \ +t]*\[)/msg ); $s =~ s/^[ \t]+//mg; # remove leading whitespace from each line $s =~ s/[ \t]+$//mg; # remove trailing whitespace from each line $s =~ s/^\s+//; # remove leading whitespace $s =~ s/\s+$//; # remove trailing whitespace # Remove up to three trailing comment lines $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s; return $s; } my $inifile_contents = <<'BUK_LIKES_SUNDIALS'; [MySection] ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; This is the heading for AnotherSection [AnotherSection] ; another comment asfld=69 BUK_LIKES_SUNDIALS my $section1 = get_section( $inifile_contents, 'MySection' ); print "This is the contents of MySection -------\n$section1\n"; my $section2 = get_section( $inifile_contents, 'AnotherSection' ); print "This is the contents of AnotherSection -------\n$section2\n";

Running the test program above produces:

This is the contents of MySection ------- ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 This is the contents of AnotherSection ------- ; another comment asfld=69

I added the code to remove trailing comment lines because I found, in practice, that trailing comment lines in a section tended to be unrelated to that section, rather they were usually header comment lines for the following section.

Though general suggestions for code improvements are welcome, my specific question relates to this eyesore:

# Remove up to three trailing comment lines $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s;
that I am currently using to remove trailing comment lines from a section. What's a better way to do it?

Replies are listed 'Best First'.
Re: Removing multiple trailing comment lines from a string (\n)
by tye (Sage) on Dec 23, 2016 at 05:01 UTC
Re: Removing multiple trailing comment lines from a string
by Laurent_R (Canon) on Dec 23, 2016 at 07:28 UTC
    It would be good to see some oof your data, to try to figure out what you're doing and why.

    Just a couple of comments on some code details, although it might be that it would be better to change it overall.

    Why would you need this:

    $s =~ s/^[ \t]+//mg; # remove leading whitespace from each line $s =~ s/^\s+//; # remove leading whitespace
    when the second line will do everything that the firstline is doing? (Same from trailing spaces).

    Similarly, I don't see the reason to run the same pair of statements three times:

    $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s;
    Once should be enough, no? And I doubt the chomp is useful here.

      Why would you need this:
      $s =~ s/^[ \t]+//mg; # remove leading whitespace from each line $s =~ s/^\s+//; # remove leading whitespace
      when the second line will do everything that the firstline is doing?
      Remember that $s is a multi-line string. So the regex mg modifier in the first regex above ensures that each line in the multi-line string has leading spaces and tabs removed from it.

      The second regex, OTOH, does not have any modifiers, so it does not apply to every line in the multi-line string; instead, it trims leading whitespace (this time, including newlines) from the front of the multiline string -- trimming multiple blank lines from the front of a multi-line string, for example.

      Similarly, I don't see the reason to run the same pair of statements three times:
      $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s; $s =~ s/^;.*\Z//m; chomp $s;
      Once should be enough, no? And I doubt the chomp is useful here.
      Again, remember we are dealing with a multi-line string. So running it once removes just the last comment line, not the last three comment lines. Also, please note that the first:
      $s =~ s/^;.*\Z//m;
      removes the contents of the last comment line of a multi-line string (note that \Z matches just the end of the (multi-line) string, not the end of each line). So if you ran it again without the chomp it would do nothing because you've have already removed the last comment line! The chomp is needed to remove the newline now sitting at the end of the string. An alternative to chomp, suggested above by tye, is to eschew the m modifier and remove the newline as part of the regex, like so:
      s/\n;.*\Z//

Re: Removing multiple trailing comment lines from a string
by Marshall (Canon) on Dec 23, 2016 at 11:12 UTC
    I changed your 3 expressions into a while loop and I modified the test cases. Is this what you want?
    use strict; use warnings; # Given an ini file, return a string containing section contents. # (Note that ini file comment lines start with a ;) sub get_section { my $fcontents = shift; # in: ini file contents string my $section = shift; # in: section name to get # Note that the regex below will find multiple sections; # it's terminated by the start of a new section or end of file. my $s = join( "", $fcontents =~ /^[ \t]*\[$section\](.*?)(?:\Z|^[ \ +t]*\[)/msg ); $s =~ s/^[ \t]+//mg; # remove leading whitespace from each line $s =~ s/[ \t]+$//mg; # remove trailing whitespace from each line $s =~ s/^\s+//; # remove leading whitespace $s =~ s/\s+$//; # remove trailing whitespace # Remove up to three trailing comment lines #### Modified ##### while ($s =~ s/^;.*\Z//m){chomp $s;} ######## NEW ########## chomp $s; ## NEW ## return $s; } my $inifile_contents = <<'BUK_LIKES_SUNDIALS'; [MySection] ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; comment 1 inside the section ; comment 2 inside the section fld3 =89 ; trailer ; trailer 2 ; trailer 3 ; trailer 4 ; trailer 5 ; This is the heading for AnotherSection [AnotherSection] ; another comment asfld=69 BUK_LIKES_SUNDIALS my $section1 = get_section( $inifile_contents, 'MySection' ); print "This is the contents of MySection -------\n$section1\n"; my $section2 = get_section( $inifile_contents, 'AnotherSection' ); print "This is the contents of AnotherSection -------\n$section2\n"; __END__ Prints: This is the contents of MySection ------- ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; comment 1 inside the section ; comment 2 inside the section fld3 =89 This is the contents of AnotherSection ------- ; another comment asfld=69

      Is this what you want?
      Yep. Thanks. I was hoping it could be done with a single regex, but your solution looks good.

        That looks wrong to me because it removes more than three lines.

        Does this look correct?

        #!/usr/bin/perl use strict; use warnings; # Given an ini file, return a string containing section contents. # (Note that ini file comment lines start with a ;) sub get_section { my $fcontents = shift; # in: ini file contents string my $section = shift; # in: section name to get # Note that the regex below will find multiple sections; # it's terminated by the start of a new section or end of file. my $s = join( "", $fcontents =~ /^[ \t]*\[$section\](.*?)(?:\Z|^[ \ +t]*\[)/msg ); $s =~ s/^[ \t]+//mg; # remove leading whitespace from each line $s =~ s/[ \t]+$//mg; # remove trailing whitespace from each line $s =~ s/^\s+//; # remove leading whitespace $s =~ s/\s+$//; # remove trailing whitespace # Remove up to three trailing comment lines $s =~ s/ ( \n (;.*)? ){1,3} \z //x; # hopefully less of an eyesore return $s; } my $inifile_contents = <<'BUK_LIKES_SUNDIALS'; [MySection] ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; comment 1 inside the section ; comment 2 inside the section fld3 =89 ; trailer ; trailer 2 ; trailer 3 ; trailer 4 ; trailer 5 ; This is the heading for AnotherSection [AnotherSection] ; another comment asfld=69 BUK_LIKES_SUNDIALS my $section1 = get_section( $inifile_contents, 'MySection' ); print "This is the contents of MySection -------\n$section1\n"; my $section2 = get_section( $inifile_contents, 'AnotherSection' ); print "This is the contents of AnotherSection -------\n$section2\n";

        which prints

        This is the contents of MySection ------- ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; comment 1 inside the section ; comment 2 inside the section fld3 =89 ; trailer ; trailer 2 ; trailer 3 ; trailer 4 This is the contents of AnotherSection ------- ; another comment asfld=69
Re: Removing multiple trailing comment lines from a string
by kcott (Archbishop) on Dec 23, 2016 at 17:19 UTC

    G'day eyepopslikeamosquito,

    I can see what you've done to run the tests; however, I don't know how that translates to your real-world code. The solution I've provided below is substantially different from your code. The main differences are:

    • In your code, you pass the entire INI file as a string to &get_section and parse it with a regex. You do this every time that function is called. In my solution, I read the INI file once, clean it up and store the result in a hash (&get_clean_ini_data). &get_section now only contains a single statement which accesses the data in that hash.
    • I've reduced your four whitespace removal regexes to a single regex: s/^\s*(.*?)\s*$/$1/.
    • There's only one other regex (for capturing the section name): /^\[([^]]+)/.
    • The removal of trailing comments is done by &strip_trailing_comments. This simply works backwards through a section's lines; removing comments until a non-comment line is found. The index function, rather than a regex, is used to identify these comments.
    • I've also added a [WhitespaceSection] with test data for checking the whitespace cleanup.
    • You could probably adapt this to your real-world requirements by making the INI filename an argument to &get_clean_ini_data; adding an open statement; and changing <DATA> to <$ini_fh>. I think everything else should work as is.

    Here's "pm_1178405_ini_file_clean.pl":

    #!/usr/bin/env perl -l use strict; use warnings; get_clean_ini_data(); for (qw{MySection AnotherSection WhitespaceSection}) { print "Contents of '$_':\n", get_section($_); } { my %section_lines_for; sub get_clean_ini_data { my $current_section; while (<DATA>) { s/^\s*(.*?)\s*$/$1/; next unless length; if (/^\[([^]]+)/) { my $new_section = $1; strip_trailing_comments($current_section); $current_section = $new_section; } else { push @{$section_lines_for{$current_section}}, $_; } } strip_trailing_comments($current_section); } sub strip_trailing_comments { my $section = shift; return unless defined $section; for my $i (reverse 0 .. $#{$section_lines_for{$section}}) { if (0 == index $section_lines_for{$section}[$i], ';') { pop @{$section_lines_for{$section}}; } else { last; } } } sub get_section { join "\n", @{$section_lines_for{$_[0]}} } } __DATA__ [MySection] ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; This is the heading for AnotherSection [AnotherSection] ; another comment asfld=69 ; Heading for WhitespaceSection [WhitespaceSection] ; Comment starting with a tab ; Comment starting with a tab and a space ; Comment starting with a space ; Comment ending with a tab ; Comment ending with a tab and a space ; Comment ending with a space ; tab+space+comment+space+tab ; space+tab+comment+tab+space qwe=rty asd=fgh ; trailing 1 ; tab + trailing 2 ; space + trailing 3 ; trailing 4

    Output:

    $ pm_1178405_ini_file_clean.pl Contents of 'MySection': ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 Contents of 'AnotherSection': ; another comment asfld=69 Contents of 'WhitespaceSection': ; Comment starting with a tab ; Comment starting with a tab and a space ; Comment starting with a space ; Comment ending with a tab ; Comment ending with a tab and a space ; Comment ending with a space ; tab+space+comment+space+tab ; space+tab+comment+tab+space qwe=rty asd=fgh

    Because whitespace is difficult to see (especially differentiating spaces from tabs), I passed the script and output through `cat -vet`. I used this for my own testing; you might also find it useful. The relevant parts are in the spoiler.

    — Ken

      I don't know how that translates to your real-world code. The solution I've provided below is substantially different from your code.
      It's early days yet and requirements are a bit unclear right now. I was after ideas for general approaches and you've provided some interesting and useful code. Thanks.

Re: Removing multiple trailing comment lines from a string (array of tokens)
by Anonymous Monk on Dec 23, 2016 at 04:22 UTC

    Though general suggestions for code improvements are welcome, my specific question relates to this eyesore:

    :) Sorry, its all eyesore :P

    Mostly the problem is you're trying to s///ubstitute when you should be manipulating an array after m//atching , for example

    push @stack, [ COMMENT => $1 ]; ... $stack[-1][0] eq 'COMMENT' and pop @stack for 1..3;

    But freel free to do your own benchmarks

    $ perl -le " $_ = qq{;banana\n;ro\n;sham\n;bo\n}; print; s{(?:^;[^\r\n +]*[\r\n]+){1,3}\Z}{}m; print; " ;banana ;ro ;sham ;bo ;banana

    Thats just what I think at the moment :)

Re: Removing multiple trailing comment lines from a string
by Marshall (Canon) on Dec 26, 2016 at 01:48 UTC
    I looked again at this thread and at kcott's line by line solution at Re: Removing multiple trailing comment lines from a string
    I would also be thinking of parsing line by line instead of as string.
    I've shown another method for that.

    Some general comments re: .ini files:

    • I find the comments meaningless unless I have to modify and re-write the .ini file.
    • These kind of files often are a mixture of program generated and user modified contents.
      The comments can be very helpful to the user although they are meaningless to the program. I am unsure that modifiying the user typed in comments in any way is a good idea? Mileage varies.
    • Getting a definition of what the file looks like exactly is in my opinion important.
    • Some files have an important "root" un-named section at the beginning with maybe version number and stuff like that as well as sometimes parameters which over-ride variables with the sections. Whoa! If you are writing or can influence the spec, I would recommend not doing that.
    • Some Perl modules like Config Tiny do a good job at getting the basic parms into an HoH. Something this code does not handle (nor others)

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %HoA; while (my $line=<DATA>) { my $next_line = process_section ($1) if $line =~ /\s*\[\s*(\S+)\s*\ +]/; if (defined $next_line) {$line = $next_line; redo} } print Dumper \%HoA; sub process_section { my $section = shift; my $line; while (defined ($line=<DATA>) and $line !~ /\s*\[\s*(\S+)\s*\]/) { chomp $line; next if $line =~ /^\s*$/; # skip blank lines push @{$HoA{$section}}, $line; } # delete the "trailing comments" in this [section] heading my $comment; while ($comment = pop @{$HoA{$section}} and $comment =~ /^\s*;/ ){} +; push @{$HoA{$section}}, $comment; return $line; } =prints $VAR1 = { 'AnotherSection' => [ '; another comment', 'asfld=69' ], 'MySection' => [ '; This is a comment line for MySection', 'fld1 = \'value of field 1\' ', 'fld2 = 42' ], 'WhitespaceSection' => [ ' ; Comment starting with a tab' +, ' ; Comment starting with a tab + and a space', ' ; Comment starting with a space', '; Comment ending with a tab ', '; Comment ending with a tab and a +space ', '; Comment ending with a space ', ' ; tab+space+comment+space+tab + ', ' ; space+tab+comment+tab+space + ', 'qwe=rty', 'asd=fgh' ] }; =cut __DATA__ ; this is root section a = 2 ; some comment in root b = 3 ; some trailing comment in root [MySection] ; This is a comment line for MySection fld1 = 'value of field 1' fld2 = 42 ; This is the heading for AnotherSection [AnotherSection] ; another comment asfld=69 ; Heading for WhitespaceSection [WhitespaceSection] ; Comment starting with a tab ; Comment starting with a tab and a space ; Comment starting with a space ; Comment ending with a tab ; Comment ending with a tab and a space ; Comment ending with a space ; tab+space+comment+space+tab ; space+tab+comment+tab+space qwe=rty asd=fgh ; trailing 1 ; tab + trailing 2 ; space + trailing 3 ; trailing 4
    PS: my previous code explicitly allowed more than 3 trailing comments because I thought that was a requirement and was one of the "problems". I am also not sure why some .ini file comments should be ignored and others not? That is a strange thing to me.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1178405]
Approved by beech
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2022-07-01 08:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (98 votes). Check out past polls.

    Notices?