Komodo... watch out!

Maybe ActiveState will get a hold of me...

I've just crafted some... err, crafty... Perl code, using Perl 5.6's regex abilities. It highlights the portion of a string that is matched by a selected fragment of a regex. This has two practical applications:

helps in regex debugging (which I'm all for making easier)
does what /(pat)+/ DOESN'T

Here's the code (if you're too lazy to click the above link):

$text = "1981 born; 1993 7th grade; 1995 9th grade; 1999 RPI; 2001 TPC
+";
@section = ('\d{4}', '([^;]+)');

$text =~ m{
  (?{ @s = map [], 1 .. @section })

  (?:
    (?{ local $s[0] = [ @{ $s[0] }, [ length "$`$&" ] ] })
     \d{4}
    (?{ $s[0][-1][1] = length "$`$&" })

    \s+

    (?{ local $s[1] = [ @{ $s[1] }, [ length "$`$&" ] ] })
     [^;]+
    (?{ $s[1][-1][1] = length "$`$&" })

    (?: ;\s+ | $ )
  )+

  (?{ @watch = @s })
}x;


for (0 .. $#watch) {
  print "$section[$_] matched in the following places:\n";
  while (my $find = shift @{ $watch[$_] }) {
    my ($s, $e) = @$find;
    print
      substr($text, 0, $s),
      "<\e[1m", substr($text, $s, $e-$s), "\e[m>",
      substr($text, $e),
      "\n";
  }
}
[download]

What's it do? Well, right now, it's all manually done, but I'll be subclassing YAPE::Regex to make it automatic. But this is what it does: it keeps track, with an array, of the position in the string BEFORE and AFTER a specific chunk of the regex. It has all those local()s to make sure that on a failed attempt, whatever was just done is gotten rid of. Anyway, it builds up a list of array references, holding start/end position pairs. Then, it goes through and highlights (puts < and > around the selection, and tries ANSI reversal) the sections matched by the selected regex chunks.

This was fun to write. It'll be more fun to automate.

Now, what was that about /(pat)+/? Well, as I've said before, doing "abc" =~ /(.)+/ puts "c" in $1. How can we get the repeated sense we were looking for? I'm so glad you asked... ;)

# extracts the attributes from an HTML tag
# and displays them, separately, with one regex

$text = q{<img src="foo.jpg" ismap border=0>};
@section = ('attributes');
   
$text =~ m{
  (?{ @s = map [], 1 .. @section })

  < \w+
  
  (?:
    \s+
    
    (?{ local $s[0] = [ @{ $s[0] }, [ length "$`$&" ] ] })
     \w+
    (?{ $s[0][-1][1] = length "$`$&" })
   
    (?:
      \s* = \s*
      (?:
        "
        (?{ local $s[1] = [ @{ $s[1] }, [ length "$`$&" ] ] })
         [^"]*
        (?{ $s[1][-1][1] = length "$`$&" })
        "
        |
        '
        (?{ local $s[1] = [ @{ $s[1] }, [ length "$`$&" ] ] })
         [^']*
        (?{ $s[1][-1][1] = length "$`$&" })
        '
        |
        (?{ local $s[1] = [ @{ $s[1] }, [ length "$`$&" ] ] })
         [^\s>]+
        (?{ $s[1][-1][1] = length "$`$&" })
      )
      |
      (?{ local $s[1] = [ @{ $s[1] }, [ -1 ] ] })
    )
  )*

  \s* >
  
  (?{ @watch = @s })
}x; 
    
    
print "The following attributes were found:\n";
for (@watch) {
  my $i = 0;
  while (my $find = shift @{ $watch[$i++] }) {
    my ($s, $e) = @$find;
    print("\n"), next if $s == -1;
    print "=" if $i == 2;
    print substr($text, $s, $e-$s);
    print "\n" if $i == 2;
  }
}

__END__
output:

The following attributes were found:
src=foo.jpg
ismap
border=0
[download]

(Note to self: damn, I'm good.)

japhy -- Perl and Regex Hacker

Comment on Komodo... watch out! Select or Download Code

Replies are listed 'Best First'.
Re: Komodo... watch out! by tachyon (Chancellor) on Jul 05, 2001 at 01:41 UTC
`"abc" =~ /(.+)/; print "$1\n"; "aaabcabccc" =~ /((?:abc)+)/; print "$1\n";` [download] tachyon s&&rsenoyhcatreve&&&s&n\w+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Komodo... watch out! by japhy (Canon) on Jul 05, 2001 at 01:54 UTC
Which you then have to parse or split up a second time. `japhy` -- Perl and Regex Hacker	[reply]
Re: Re: Re: Komodo... watch out! by tachyon (Chancellor) on Jul 05, 2001 at 02:07 UTC
Surely you can't have it both ways? Either you want to match one lot of "pat" or several lots of "pat". You match what you want and do what you want with it, no further parsing required. You use a while (/(pat)/g) construct to get all the occurences of "pat" into $1 in sequence if that is what you want. cheers tachyon s&&rsenoyhcatreve&&&s&n\w+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Re: Re: Komodo... watch out! by japhy (Canon) on Jul 05, 2001 at 04:44 UTC