ExReg has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a large script to analyze code. I am trying to find key words in code, but not if they are in commented lines. Thus, if I am looking for foo, I want to match
maybe some stuff foo maybe some more stuff;
but not
// maybe some stuff foo maybe some more stuff;
This looks like an appropriate place to use a negative lookbehind assertion. I have not found the right form yet. Experimenting, I have tried
$a="abcdefghijklm"; 1 print "match" if ( $a =~ /(?<!cde).*?jkl/ ); # match because a do +es not match cde, then bcdefghi follows, then jkl matches? 2 print "match" if ( $a =~ /(?<=cde).*?jkl/ ); # match because cde +matches then fgh follws, then jkl matches $a="xyzfghijklm"; 1 print "match" if ( $a =~ /(?<!cde).*?jkl/ ); # match because cde +is not before jkl, then jkl matches 2 print "match" if ( $a =~ /(?<=cde).*?jkl/ ); # no match because c +de is not before jkl

I am looking for a regex that will not match abcdefghijklm because cde precedes jkl, but would match xyzfghijkl. I would have thought that something like the first one above would be what I wanted.

If I wanted it to not match when jkl must not be preceded by only c, the answer is easy: ^[^c]*jkl. But I have multiple character cde things I must exclude.

I am guessing that what I am looking for will have a (?<!cde) and then a jkl in it.

One trick I used several years ago, seems to work here. Since you cannot do variable length lookbehinds, I can reverse the strings and then do variable length lookaheads. i.e.

$a = "mlkjihgfedcba"; print "match" if ( $a =~ /lkj(?!.*?edc/ ); # no match because edc D +OES follow lkj. This is what is desired. (It does not match in the un +reversed when cde precedes jkl) $a = "mlkjihgfzyx"; print "match" if ( $a =~ /lkj(?!.*?edc/ ); # match because edc does + not follow lkj. Again, this is what I want. ( It does match jkl in t +he unreversed when not preceded by cde)

Is there another way?

I am guessing this is one of those areas where regexes come up short. I just learned about (?{ }) and (??{ }), but don't fully understand them yet. Where is a good place to learn advanced regex aside from perlre and Friedl?

I am limited to perl 5.6, if that makes a difference. I cannot use CPAN or any other libraries.

If this question seems disjoint, it is because my brain stores in hashes instead of sorted arrays.

Replies are listed 'Best First'.
Re: Capture uncommented keywords
by atcroft (Abbot) on Jul 27, 2015 at 20:59 UTC
    I cannot use CPAN or any other libraries.

    Actually, Yes, even you can use CPAN. If nothing else, you could copy code from a module and put it directly in your script, although you would not benefit from potential bug fixes, etc. (I would question why this artificial limitation exists, however.)

    As to the main question, it seems like the easiest thing would be to first remove comment lines (using either regexes or, more likely, a state machine), then process the remainder.

    Hope that helps.

      I do like CPAN and use it at home; I just can't use it at work. The removing all the commented out code first is a possibility, but I would much prefer a means to find these via regex alone. The script I am working on allows checking for new constructs by simply writing a regex and adding it with a name to a module. Need to find a new thing? Add a regex to a hash. Done. The old way took dozens of scripts tens of thousands of lines of perl. The new way, less than 500.
      It would require that I type the CPAN source code into one computer while reading it from the other. No other way. Period.
Re: Capture uncommented keywords
by afoken (Chancellor) on Jul 28, 2015 at 14:41 UTC
    Is there another way?

    Remove comments from the input before matching.

    For trivial code containing only // comments (not /* */) and without string constants containing //, the following should do the trick:

    #!/usr/bin/perl use strict; use warnings; while (<>) { chomp; my $orig=$_; s|//.*||; # strip comments if (/\b(printf|scanf|open|close|read|write)\b/) { print "Found keyword '$1' in '$_'.\nOriginal line: '$orig'\n"; } }

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      The more I think about the problem, the more it looks like I will have to do additional processing like creating a copy of the code with the commented lines removed. It will add additional cycles to the analysis and make it simply blazingly fast instead of insanely fast compared to the old. Sigh...

      Thanks