Regex Strikes again!

nofernandes has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex Strikes again! by Abigail-II (Bishop) on Jul 16, 2003 at 09:48 UTC
You are slurping in the file in one go, due to your setting of `$/` to undef. That means `<F>` is going to return a single string, the content of the entire file. Which means that if the file contains a C or C++ style comment, or a single or double quoted string, that single string is passed to `map`. The map makes a two element list out of this, the first element the line number of the last "line" of the file - it this case 1, as there is only one line. So, you get a one-key hash, with key 1, and value the entire content of the file. The problem you are working on is a bit more complicated than what you can do in a simple regexp. Abigail	[reply] [d/l]
Re: Re: Regex Strikes again! by nofernandes (Beadle) on Jul 16, 2003 at 10:07 UTC
Hmm.. i see.. your rigth! But can i put this condition in while cicle!!? I guess not because in that case i could not catch multiline comments!! Because in this case the file would be read line by line! And i must read all the file at once! Don´t i? Thank you again	[reply]
Re: Regex Strikes again! by Abigail-II (Bishop) on Jul 16, 2003 at 11:09 UTC
Put which condition in which while cycle? There's no while in your code fragment. Which part of "what you are trying to do is to complex for a simple regexp" do you fail to understand? I've said this before, what you need is a parser, nor a regexp. For simple languages like Java, you can get away with a parser that's not a full-language parser. I told you this three weeks ago, but you still are on the regexp road, and have made remarkably little progress. Don't match - parse! Abigail	[reply]
Re: Regex Strikes again! by ant9000 (Monk) on Jul 16, 2003 at 10:14 UTC
This snippet should give you the idea: `my $file="Finger.java"; my %comments=(); open(F,"<$file") or die "Can't open $file: $!"; while(defined($_=<F>)){ $comments{$.}=$_ if( m{...PUT THE MATCH YOU NEED HERE...} ); } close(F);` [download] Reading the file line by line like I do has a drawback, though: you now need to take care yourself of multiline comments within /* ... */, and append to the correct string as needed.	[reply] [d/l]
Re: Regex Strikes again! by flounder99 (Friar) on Jul 16, 2003 at 12:46 UTC
This code might point you in the right direction. use strict; my $slurpedfile = q{Blah blah blah blah blah Blah // single line c++ comment more blahs "quoted string with /* " blah blah blah blah /* single line c comment / blah blah blah / multi line c style comment / some more blahs // another single line c++ comment blah blah blah }; my @matches = $slurpedfile =~ m{ ( /\ .? \/) \| ( \/\/[^\n]) \| " (?: [^"\\] \| \\. )* " \| ' (?: [^'\\]* \| \\. )* ' \| . [^/"']* }xgs; @matches = grep {defined $_} @matches; #get rid of undefs my $linenum = 1; foreach my $match (@matches) { $slurpedfile =~ /\Q$match/; my $before = $`; $slurpedfile = $'; my $matched = $&; $linenum += $before =~ tr/\n/\n/; print "$match\n is on line $linenum\n\n"; $linenum += $match =~ tr/\n/\n/; } __OUTPUT__ // single line c++ comment is on line 2 /* single line c comment / is on line 4 / multi line c style comment / is on line 5 // another single line c++ comment is on line 9 [download] It will screw up if you have the same comment in a quoted string near the comment itself, like this: `comment = ' / comment / '; / comment */` [download] but maybe you can fiddle with `$slurpedfile =~ /\Q$match/;` to take care of that case. -- flounder	[reply] [d/l] [select]
Re: Re: Regex Strikes again! by nofernandes (Beadle) on Jul 16, 2003 at 14:34 UTC
Thank you very much for your help! Your code works fine with the variable.. but when i try to read a file, the number of the lines are wrong! For example if i put the content of your variable $sluperdfile and add only one line such as //Hello: `#Content of the file test.txt Blah blah blah //Hello blah blah Blah // single line c++ comment more blahs "quoted string with /* " blah blah blah blah /* single line c comment / blah blah blah / multi line c style comment / some more blahs // another single line c++ comment blah blah blah` [download] And now the source code: use strict; undef $/;#In order to read the whole file at once open(F,"test.txt"); my @matches = <F> =~ m{ ( /\ .? \/) \| ( \/\/[^\n]) \| " (?: [^"\\] \| \\. )* " \| ' (?: [^'\\]* \| \\. )* ' \| . [^/"']* }xgs; @matches = grep {defined $_} @matches; #get rid of undefs my $linenum = 1; foreach my $match (@matches) { $slurpedfile =~ /\Q$match/; my $before = $`; $slurpedfile = $'; my $matched = $&; $linenum += $before =~ tr/\n/\n/; print "Line $linenum\t$match\n"; $linenum += $match =~ tr/\n/\n/; } [download] How can i grab the line numbers correctly! What variable should i put instead of $slurpedfile? According that i want to read directly from a file? And another issue.. Is it possible to the output be something like this: `Line 12 //Hello Line 23 // single line c++ comment Line 34 /* single line c comment / Line 45 / Line 46 multi line Line 47 c style comment / Line 58 // another single line c++ comment` [download] Instead of: `Line 12 //Hello Line 23 // single line c++ comment Line 34 / single line c comment / Line 45 / multi line c style comment */ Line 58 // another single line c++ comment` [download] Thank you.. for your help... Nuno	[reply] [d/l] [select]
Re^3: Regex Strikes again! by flounder99 (Friar) on Jul 16, 2003 at 15:31 UTC
Try this: use strict; undef $/;#In order to read the whole file at once open(F,"test.txt") or die $!; my $slurpedfile = <F>; close F; my @matches = $slurpedfile =~ m{ ( /\* .? \/) \| ( \/\/[^\n]) \| " (?: [^"\\] \| \\. )* " \| ' (?: [^'\\]* \| \\. )* ' \| . [^/"']* }xgs; @matches = grep {defined $_} @matches; #get rid of undefs my $linenum = 1; foreach my $match (@matches) { $slurpedfile =~ /\Q$match/; my $before = $`; $slurpedfile = $'; $linenum += $before =~ tr/\n/\n/; foreach (split "\n", $match) { print "Line $linenum\t$_\n"; $linenum++; } $linenum--; # the foreach above adds one too many } __OUTPUT__ Line 2 //Hello Line 3 // single line c++ comment Line 5 /* single line c comment / Line 6 / Line 7 multi line Line 8 c style comment */ Line 10 // another single line c++ comment [download] -- flounder	[reply] [d/l]
Re: Re^3: Regex Strikes again! by nofernandes (Beadle) on Jul 16, 2003 at 16:11 UTC
Re: Regex Strikes again! by johndageek (Hermit) on Jul 16, 2003 at 14:03 UTC
Assuming you are looking for all comments in some nasty(large) java programs, first what are all the possible ways a comment can be mal-formed? And are they critical to your life goal? That said here is a sample input file with a few interesting possibilities (yes it compiles): /** * The HelloWorldApp Class impliments an application that displays "H +ello World!" to standard output / public class HelloWorldApp { public static void main (String[] args) { // Display "Hello World!" System.out.println("Hello World!"); // end of line comment System.out.println("Hello World!"); / end of line comment / System.out.println("Hello World!/"); /* end of line comment / / this is a multi line comment / / another comment / // / is this a valid comment /* // / System.out.println("//Hello World!/"); // is /* this valid } } [download] Below is some rough code to start grabbing comments. Coding for the cute exceptions is left as an exercise for the student (I hava always wanted to use that line, since it was used on me a long time ago) or as future discussion points. It prints the line number of the file where the comment begins and then the comment. #!/usr/bin/perl use strict; my $infile = "HelloWorldApp.java"; my %linehash; my $linecounter =0; my $comment_started = 0; my $multiline_comment; open in, "$infile" or die "could not open $infile\n"; while (<in>) { ## you may need to get creative in matching comments ## because java allows some fun combinations - see HelloWorld ## ## grab single line comments first else look for multiline if (/\/\/.\n/ \|\| /\/\.\\//) { $linehash{$linecounter} = $_; } else { ## possible multiline comment start if ($comment_started) { $multiline_comment = $multiline_comment . $_; ## don't mess with $_ as later comparisons may need the newline in pla +ce chomp $multiline_comment; $multiline_comment = $multiline_comment . " "; ## end of multiline comment if (/\\//) { $linehash{$comment_started} = $multiline_comment; $comment_started = 0; $multiline_comment = ""; } } ## start multiline comment if (/\/\/) { $comment_started=$linecounter; } } $linecounter++; } my @keys = sort{$a <=> $b}(keys(%linehash)); for (@keys) { print "key=$_ value=$linehash{$_}"; } [download] Enjoy John	[reply] [d/l] [select]
Re: Re: Regex Strikes again! by nofernandes (Beadle) on Jul 16, 2003 at 14:53 UTC
Your code is very good.. thank you.. and if i had it a few weaks earlier it migtht have been a very good option for my program But the other regex works just fine and catches all the cases! And is simply "changable" in order to serve to other languages as PLSQL, ProC, etc.. The only problem that i have is how to catch the number of lines!! I´ve made some code in order to compare two files and it works "almost"* fine but in larger files it might be a bit slow!! I said that almost works fine because it "flips over" when he finds lines like this /*********************/ !!! And i don´t have any idea why!!! Maybe you can help me out to figurate out what is the problem!!! `foreach my $line (@fich){ $i++; $flag=0; foreach $comm (@com){ if( (($line eq $comm) \|\| (index($line, $comm) > -1)) && ($flag +==0)){ print"Linha $i: $comm"; $flag=1; print "Flag1: $flag\n"; } } }` [download] The two arrays contain two files!!The @com contains the file with the extracted comments and the @fich contains the content of the source code!! Thank you very much!! Nuno	[reply] [d/l]
Re: Re: Re: Regex Strikes again! by johndageek (Hermit) on Jul 16, 2003 at 19:42 UTC
This will give you the correct line number count. counting the newlines in the matched set isn't what you want, unless I misunderstood your code. - slurped file into variable $f and changed reference of $slurpedfile to the original input file. hth John #!/usr/bin/perl $file="theinputfile"; undef $/; #In order to read the whole file at once open(F,"$file"); $f = <F>; my @matches = $f =~ m{ ( /\* .? \/) \| ( \/\/[^\n]) \| " (?: [^"\\] \| \\. )* " \| ' (?: [^'\\]* \| \\. )* ' \| . [^/"']* }xgs; @matches = grep {defined $_} @matches; #get rid of undefs my $linenum = 1; foreach my $match (@matches) { # $slurpedfile =~ /\Q$match/; $f =~ /\Q$match/; my $before = $`; # $slurpedfile = $'; $f = $'; my $matched = $&; $linenum += $before =~ tr/\n/\n/; print "Line $linenum\t$match\n"; $linenum += $match =~ tr/\n/\n/; } [download]	[reply] [d/l]
Re: Regex Strikes again! by jmanning2k (Pilgrim) on Jul 16, 2003 at 18:18 UTC
I had a working solution to your problem yesterday, in this node. It got hidden by the depth threshold. I've included it below. As I said earlier, you can't possibly use 'undef $/' and expect to get line numbers, since with undef $/, everything is read as a single line. There are certainly more elegant solutions than mine, and many are probably already posted in this thread. But, this is a pattern I've used repeatedly in my own code. I found a working solution, though incomplete. I'm not sure what the last few lines of your regex do. I'm guessing it also finds all quoted strings in the text. You can modify this to work for that case too. my $match; my $line; while(<F>) { if(m{/\} .. m{\/}) { ## single line if(m{(/\.?\/)}) { $match = $1; $line = $.; $hash{$line} = $match; #print "Line $line: Got match '$match'\n"; } else { ## multi-line if( m{(/\.)} ) { ## Initial line '/' $match = $1 . "\n"; $line = $.; # record this line number } elsif( m{(.\/)} ) { ## Final line '/' $match .= $1; $line = $.; $hash{$line} = $match; #print "Line $line: Got match '$match'\n"; $match = undef; $line = undef; } else { # We are between lines, and have no / or / $match .= $_; } } } elsif ( m{(//.)\Z} ) { $match = $1; $line = $.; $hash{$line} = $match; #print "Line $line: Got match '$match'\n"; } } [download] So, this stores the starting line number and comment string in a hash. Hopefully this gives you an idea of how to process several lines. It's certainly not as nice as the quick and simple grep solution, but grep aggregates all the results together, so you can't get line numbers out of the results. I did see this in the output of perldoc -f grep 'grep returns aliases into the original list' so, perhaps you can somehow map back the results into the original array, but I have no idea how. My ideal solution would be some variant of the grep statement you have, perhaps with a map statement instead of grep, and some nice (linenumber, string) pair returned. Can't seem to find anything that works though. Hope this helps, ~J	[reply] [d/l]
Re: Re: Regex Strikes again! by nofernandes (Beadle) on Jul 17, 2003 at 08:50 UTC
Thank you very much once again!! You all have been very helpfull!! Your advices help me a lot!! Thanks Nuno	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.