Re: Regex Strikes again!

Assuming you are looking for all comments in some nasty(large) java programs, first what are all the possible ways a comment can be mal-formed? And are they critical to your life goal?

That said here is a sample input file with a few interesting possibilities (yes it compiles):

/**
 *  The HelloWorldApp Class impliments an application that displays "H
+ello World!" to standard output
 */
public class HelloWorldApp
 {
    public static void main (String[] args)
    {
       // Display "Hello World!"
       System.out.println("Hello World!"); // end of line comment
       System.out.println("Hello World!"); /* end of line comment */
       System.out.println("Hello World!/*"); /* end of line comment */
/* this
is
a
multi
line
comment
*/
/* another comment */
// /* is this a valid comment
/*
//  */
       System.out.println("//Hello World!/*"); // is /* this valid
    }
 }
[download]

Below is some rough code to start grabbing comments. Coding for the cute exceptions is left as an exercise for the student (I hava always wanted to use that line, since it was used on me a long time ago) or as future discussion points. It prints the line number of the file where the comment begins and then the comment.

#!/usr/bin/perl
use strict;
my $infile = "HelloWorldApp.java";
my %linehash;
my $linecounter =0;
my $comment_started = 0;
my $multiline_comment;
open in, "$infile" or die "could not open $infile\n";
while (<in>)
{
## you may need to get creative in matching comments
## because java allows some fun combinations - see HelloWorld
##
## grab single line comments first else look for multiline
  if (/\/\/.*\n/ || /\/\*.*\*\//)
  {
     $linehash{$linecounter} = $_;
  }
  else
  {
## possible multiline comment start
    if ($comment_started)
    {
      $multiline_comment = $multiline_comment . $_;
## don't mess with $_ as later comparisons may need the newline in pla
+ce
      chomp $multiline_comment;
      $multiline_comment = $multiline_comment . " ";
##  end of multiline comment
      if (/\*\//)
      {
        $linehash{$comment_started} = $multiline_comment;
        $comment_started = 0;
        $multiline_comment = "";
      }
    }
## start multiline comment
    if (/\/\*/)
    {
      $comment_started=$linecounter;
    }
  }
  $linecounter++;
}
my @keys = sort{$a <=> $b}(keys(%linehash));
for (@keys)
{
  print "key=$_   value=$linehash{$_}";
}
[download]

Enjoy
John

Comment on Re: Regex Strikes again! Select or Download Code

Replies are listed 'Best First'.
Re: Re: Regex Strikes again! by nofernandes (Beadle) on Jul 16, 2003 at 14:53 UTC
Your code is very good.. thank you.. and if i had it a few weaks earlier it migtht have been a very good option for my program But the other regex works just fine and catches all the cases! And is simply "changable" in order to serve to other languages as PLSQL, ProC, etc.. The only problem that i have is how to catch the number of lines!! I´ve made some code in order to compare two files and it works "almost"* fine but in larger files it might be a bit slow!! I said that almost works fine because it "flips over" when he finds lines like this /*********************/ !!! And i don´t have any idea why!!! Maybe you can help me out to figurate out what is the problem!!! `foreach my $line (@fich){ $i++; $flag=0; foreach $comm (@com){ if( (($line eq $comm) \|\| (index($line, $comm) > -1)) && ($flag +==0)){ print"Linha $i: $comm"; $flag=1; print "Flag1: $flag\n"; } } }` [download] The two arrays contain two files!!The @com contains the file with the extracted comments and the @fich contains the content of the source code!! Thank you very much!! Nuno	[reply] [d/l]
Re: Re: Re: Regex Strikes again! by johndageek (Hermit) on Jul 16, 2003 at 19:42 UTC
This will give you the correct line number count. counting the newlines in the matched set isn't what you want, unless I misunderstood your code. - slurped file into variable $f and changed reference of $slurpedfile to the original input file. hth John #!/usr/bin/perl $file="theinputfile"; undef $/; #In order to read the whole file at once open(F,"$file"); $f = <F>; my @matches = $f =~ m{ ( /\* .? \/) \| ( \/\/[^\n]) \| " (?: [^"\\] \| \\. )* " \| ' (?: [^'\\]* \| \\. )* ' \| . [^/"']* }xgs; @matches = grep {defined $_} @matches; #get rid of undefs my $linenum = 1; foreach my $match (@matches) { # $slurpedfile =~ /\Q$match/; $f =~ /\Q$match/; my $before = $`; # $slurpedfile = $'; $f = $'; my $matched = $&; $linenum += $before =~ tr/\n/\n/; print "Line $linenum\t$match\n"; $linenum += $match =~ tr/\n/\n/; } [download]	[reply] [d/l]