Re^2: Multiline string and one line comments

I realized that this is bit too hard using regex because I need to know which one of those three character appears first and recheck that each time I found a string or comment.

I went back to plain scripting and it is actually pretty easier just use index and substr function. Here is my code, code writing service ? not for me.

#!/usr/bin/env perl
use strict;
use warnings;

my $src = do {local $/; <DATA>};

my @strings = ();
my @comments = ();
my $off_set = 0;
my $end_index = 0;

while (my ($char, $start_index) = &next_char($off_set)) {
  last if ($char eq "" && $start_index == -1);

  if ($char eq '#') {
    $end_index = index $src, "\n", $start_index + 1;
    push @comments, substr($src, $start_index, $end_index-$start_index
++1);
    $off_set = $end_index + 1;
  } elsif (($char eq '"') || ($char eq "'")) {
    &capture_string($char, $start_index, $end_index);
  }
}

sub capture_string($ $ $) {
  my $quote = shift;
  my $start_index = shift;
  my $end_index = shift;

  $end_index = index ($src, $quote, $start_index+1);
  my $char_before = substr $src, $end_index-1, 1;

  while ($end_index > 0 && $char_before eq '\\') {
    $end_index = index $src, $quote, $end_index + 1;
    $char_before = substr $src, $end_index-1, 1;
  }

  push @strings, substr($src, $start_index, $end_index-$start_index+1)
+;
  $off_set = $end_index + 1;
}

print "[Strings]\n";
foreach my $item (@strings) {
  print "$item\n";
}

print "[Comments]\n";
foreach my $item (@comments) {
  print "$item";
}

sub next_char {
  my %has;
  my $position = shift;

  my $s_index = index $src, "'", $position;
  my $d_index = index $src, '"', $position;
  my $c_index = index $src, '#', $position;

  return ("", -1) if ($s_index == -1 &&
                      $d_index == -1 &&
                      $c_index == -1);

  $has{$s_index} = "'" if ($s_index >= 0);
  $has{$d_index} = '"' if ($d_index >= 0);
  $has{$c_index} = '#' if ($c_index >= 0);

  my @sorted_keys = sort { $a <=> $b} keys %has;
  # print "Next char is $has{$sorted_keys[0]}, and position is $sorted
+_keys[0]\n";
  return ($has{$sorted_keys[0]}, $sorted_keys[0]);
}


__DATA__
# this is a comment, should be matched.
# # "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on
                      each line #, have fun!";
my $descap_string = "I am a \ escaped \" \"string"; # and some comment
+s;
my $sescap_string = 'I am a \ escaped \' \'string'; # and some comment
+s;
my $empty_d ="";
my $empty_s ='';
[download]

And here is the result I wanted

[Strings]
" #I am not a comment, because I am quoted"
"I am a multiline string with # on
                      each line #, have fun!"
"I am a \ escaped \" \"string"
'I am a \ escaped \' \'string'
""
''
[Comments]
# this is a comment, should be matched.
# # "I am not a string" . 'because I am inside a comment'
# and some comments;
# and some comments;
[download]

Comment on Re^2: Multiline string and one line comments Select or Download Code

Replies are listed 'Best First'.
Re^3: Multiline string and one line comments by AnomalousMonk (Archbishop) on Apr 18, 2014 at 04:03 UTC
Here's a regex-based approach. (I agree, however, that a parsing approach may be more appropriate.) It doesn't handle single-quoted strings, but should be easily extensible to cover such. I'm not sure it gives you exactly what you want, but I think it comes close. The critical (IMHO) regexes require Perl version 5.10+. use warnings; use strict; use Test::More # tests => ?? + 1 # Test::NoWarnings adds 1 test 'no_plan' ; use Test::NoWarnings; use constant TEST1 => <<'EOT'; # this is a comment, should be matched. # "I am not a string" . 'because I am inside a comment' my $string = " #I am not a \comment, because I am \" quoted"; my $another_string = "I am a multiline string with # on each \t line #, have fun!"; EOT # print qq{[[${ \TEST1 }]] \n\n}; # FOR DEBUG use constant C1 => '# this is a comment, should be matched.'; use constant C2 => q{# "I am not a string" . 'because I am inside a co +mment'}; use constant S1 => q{" #I am not a \comment, because I am \" quoted"}; use constant S2 => q{"I am a multiline string with # on each \t line #, have fun!"}; # these regexes compatible with 5.8 (and prior? 5.0?) my $comment = qr{ [#] [^\n]* $ }xms; my $string = qr{ " [^"\\]* (?: \\. [^"\\]) " }xms; my $comment_or_string = qr{ $comment \| $string }xms; # these regexes require 5.10+ my $comment_only = qr{ $comment \| $string (SKIP) (FAIL) }xms; my $string_only = qr{ $string \| $comment (SKIP) (FAIL) }xms; VECTOR: for my $ar_vector ( [ TEST1, $comment_or_string, C1, C2, S1, S2, ], [ TEST1, $comment_only, C1, C2, ], [ TEST1, $string_only, S1, S2, ], ) { if (not ref $ar_vector) { # must be a note... note $ar_vector; next VECTOR; } my ($text, $rx, @expected) = @$ar_vector; is_deeply [ $text =~ m{ $rx }xmsg ], \@expected, # qq{}, ; } # end for VECTOR [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Multiline string and one line comments
by AnomalousMonk (Archbishop) on Apr 18, 2014 at 04:03 UTC

Here's a regex-based approach. (I agree, however, that a parsing approach may be more appropriate.) It doesn't handle single-quoted strings, but should be easily extensible to cover such. I'm not sure it gives you exactly what you want, but I think it comes close. The critical (IMHO) regexes require Perl version 5.10+.

use warnings;
use strict;


use Test::More
  # tests => ?? + 1  # Test::NoWarnings adds 1 test
    'no_plan'
    ;
use Test::NoWarnings;


use constant TEST1 => <<'EOT';
# this is a comment, should be matched.
# "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a \comment, because I am \" quoted";
my $another_string = "I am a multiline string with # on
                       each \t line #, have fun!";
EOT
# print qq{[[${ \TEST1 }]] \n\n};  # FOR DEBUG


use constant C1 => '# this is a comment, should be matched.';
use constant C2 => q{# "I am not a string" . 'because I am inside a co
+mment'};
use constant S1 => q{" #I am not a \comment, because I am \" quoted"};
use constant S2 => q{"I am a multiline string with # on
                       each \t line #, have fun!"};


# these regexes compatible with 5.8 (and prior? 5.0?)
my $comment = qr{ [#] [^\n]* $ }xms;
my $string  = qr{ " [^"\\]* (?: \\. [^"\\]*)* " }xms;
my $comment_or_string = qr{ $comment | $string }xms;


# these regexes require 5.10+
my $comment_only = qr{ $comment | $string  (*SKIP) (*FAIL) }xms;
my $string_only  = qr{ $string  | $comment (*SKIP) (*FAIL) }xms;


VECTOR:
for my $ar_vector (
    [ TEST1, $comment_or_string, C1, C2, S1, S2, ],
    [ TEST1, $comment_only,      C1, C2,         ],
    [ TEST1, $string_only,       S1, S2,         ],
    ) {

    if (not ref $ar_vector) {  # must be a note...
        note $ar_vector;
        next VECTOR;
        }

    my ($text, $rx, @expected) = @$ar_vector;

    is_deeply [ $text =~ m{ $rx }xmsg ], \@expected,
      # qq{},
        ;

    }  # end for VECTOR
[download]

[reply]
[d/l]