Perl code to remove '//' comments -in other prog. language (Mr Moderator, remove this if it's already posted somewhere.)
$text =~ s/( (?:(?:[^\/][^\/]*?|)*? # anything, except comment sym +bol ("|').*?\2 # quoted string (and take ever +ything) (?:[^\/][^\/]*?|)*? # anything, except comment sym +bol )*? ) (?: \/\/ # comment symbol [^\\]* # anything, except continuatio +n )??$ /$1/x; print $text, "\n";

Replies are listed 'Best First'.
Re: Removing '//' comments (tokenize)
by tye (Sage) on Jul 06, 2006 at 05:36 UTC

    First, try not to use a delimiter for your regex that causes you to have to escape a lot more characters. More importantly, your regex has several mistakes. Trying to match "anything except this multi-character sequence" is almost always done wrong, even by top experts (at least a few times), so such isn't a big surprise. (:

    For example, [^\/­][^\/]*? is really just the same as [^/­]+?, and so avoids matching single slashes rather than avoiding double slashes, as the construct hints at. So (?:(?:[^\/­][^\/]*?|)­*? boils down to ([^/]*?)*?, which is just an inefficient way of writing [^/]*?.

    ("|').*?\2 will stop matching too early for "This \"string\" with quotes" but can also "backtrack" and match too much. You really want to force this construct to only match exactly quoted strings. So, '([^'\\]+|\\.)*'|"([^"\\]+|\\.)*" instead.

    Your "stuff I don't care about" needs to avoid matching quotes or slashes so that you don't just skip over a starting quote as "something I don't care about". So your regex needs something like ([^'"/]+|$quotes|...)*.

    And a tricky part is the "skip over / but not over //". Something like (?<!/)/(?!/).

    Which brings us to this:

    $text =~ s< (^ (?: [^/'"]+ | '([^'\\]+|\\.)*' | "([^"\\]+|\\.)*" | (?<!/)/(?!/) )* ) //.* ><$1>xgm;

    Which likely has several bugs. Note that I didn't allow for \ to cause the comment to continue on to subsequent lines because I both believe and hope that such doesn't actually work in the languages that I use //-comments in.

    Note that the pseudo tokenizer needs to match any constructs that could contain quotes or slashes so, for example, /* ... */ would need to be handled if such might be encountered.

    - tye        

Re: Removing '//' comments
by GrandFather (Saint) on Jul 06, 2006 at 04:51 UTC

    Some interesting cases:

    use warnings; use strict; while (<DATA>) { s/( (?:(?:[^\/][^\/]*?|)*? # anything, except comment sym +bol ("|').*?\2 # quoted string (and take ever +ything) (?:[^\/][^\/]*?|)*? # anything, except comment sym +bol )*? ) (?: \/\/ # comment symbol [^\\]* # anything, except continuatio +n )??$ /$1/x; print; } __DATA__ /* // */ /* */ *str = "//\"//";

    Prints:

    /* */ /* */ *str = "//\"

    DWIM is Perl's answer to Gödel
Re: Removing '//' comments
by davidrw (Prior) on Jul 06, 2006 at 06:12 UTC
    my first thought here was Regexp::Common::Comment ... Looking in its source, it basically does s#//[^\n]*$##s; It's interesting to note that it behaves in the same way as the regexp GrandFather demo'd above, especially with *str = "//\"//";

    tye's example (though i had to make it s#foo#bar#xgm instead of s<foo><bar>xgm before it would compile) does work with GrandFather's test cases.
    use warnings; use Regexp::Common qw /comment/; while(<DATA>){ my ($line, $simple, $RE, $tye) = ($_)x4; $simple =~ s#//[^\n]*$##s; $RE =~ s#$RE{comment}{Portia}#\n#; $tye =~ s# (^ (?: [^/'"]+ | '([^'\\]+|\\.)*' | "([^"\\]+|\\.)*" | (?<!/)/(?!/) )* ) //.* #$1#xgm; print " [DATA] $line"; print "[simple] $simple"; print " [RE] $RE"; print " [tye] $tye"; print "\n"; } __DATA__ blah // comment /* // */ /* */ *str = "//\"//";
Re: Removing '//' comments
by Zaxo (Archbishop) on Jul 06, 2006 at 04:48 UTC

    Fails if '//' occurs in quoted text,

    int foo = sprintf( line, "Comparing, //-comments in C/C++ act like #-comments in %s. ", "perl");

    Update: oops, misread, see GrandFather's examples instead.

    After Compline,
    Zaxo