WHolcomb has asked for the wisdom of the Perl Monks concerning the following question:

I posted this a bit ago in the regex section, but being unfamiliar with the quoting standards I mangled it a bit. Here it is (hopefully) in readable form.

I am trying to write a regex to remove the comments from lines but I want for people to be able to use the comment character in their input if they comment it ala. perl/c/java backslash metaquoting. I would like a regex to use like s/$regex/$1/ to remove the comments.

As it is I have been using the following (squished a bit to save room):
$m = "\\"; # Meta $c = "\#"; # Comment print "Begin: $_\n"; split /\Q$c\E/; $string = $_ = $_[0]; # Because split puts the first non-blank line +in $_ for($i = 1; $i <= $#_; $i++) { $_ = $_[$i - 1]; (/((^|[^\Q$m\E])((\Q$m\E){2})*$)/) ? (last) : ($string .= "$c" . $_[ +$i]); } print " End: $string\n";
And that has caught every test case I have made up, like:
this is a line with a \# pound # and a comment
this is a \# line with three \#'\#'s # and a comment
this is a line \\\\ with shashes \\\\# and a comment
this is a line \\\\ with shashes \\\# and a pound # and a comment
#this line is only comment
\#this line begins with a pound
\\# This line begins with a slash
\# This line \# \\# has a pound at the beginning
Can anyone come up with a regex to do the same job?

Will

Replies are listed 'Best First'.
Re: quoting characters
by ahunter (Monk) on Apr 14, 2000 at 19:28 UTC
    Think I've got it...
    while ($_ = <STDIN>) { chomp; s/^((([^\#\\])|(\\.))*)\#.*$/$1/; print "$_\n"; }
    Seems to do the trick, assuming record delimiters in STDIN are correctly set. Basically, the regexp matches anything that isn't a comment or a quote character, *or* anything that's a quote character followed by anything. Because perl is greedy, this matches up to the first unquoted comment character, and we just substitute away anything after that...
    I think substituting out the quotes themselves requires another s/// statement, but it's fairly trivial ;-)
      That is exactly what I have been trying to come up with for the last week. I remembered seeing someone somewhere write a regex to match a C string and I knew that was what I needed, but I couldn't remember the form. Thanks.
Re: quoting characters
by turnstep (Parson) on Apr 14, 2000 at 19:56 UTC

    This seems to do the trick:

    s/(\\*)(#[^#\\]*)/length($1)?length($1)%2?"$1$2":$1:""/eg;

    And hear it is again, in a more readable form:

    s/(\\*)(#[^#\\]*)/ ## Matching anything before a pound sign, putting (back)slashes (or la +ck thereof) into $1. ## Also make sure that we do not grab any slashes and pounds after the + first pound { if (length($1)) { ## If we found some backslashes if (length($1)%2) ## If there are an odd number of backslashes print "$1$2"; ## Return all slashes and pound - this is not a true + comment } else { print $1; } ## Even number - return only the slashes } else { print ""; } ## Return nothing, this is a comment } /eg;

Re: quoting characters
by turnstep (Parson) on Apr 14, 2000 at 19:56 UTC

    After thinking about the problem some more, I realized that my solution would not handle another case, not represented by the examples in the original question, namely something as silly as:

    This code stops here # but comments continue \# with a fake extension +here
    A workaround is to make the regex into two expressions:
    s/(\\*)(#[^#\\]*)/length($1)?length($1)%2?"$1$2":$1:"TURNSTEP"/eg; s/TURNSTEP.*//;

    This makes sure that everything after a genuine comment is removed, period.

    Ideally, you'd use something shorter and not likely to be in the input, e.g. a control character or something, but then again, 'TURNSTEP' is not very likely either. (If it is, I'd like to see that code! ;)