I plan on having a draft of my regex article ready for review by the end of June. Hopefully, by early July, Regexp::Parser will be on CPAN. Once that's ready to use, I'm going to make a couple sub-modules (like Regexp::Explain), and then I'm going to work subclassing it to match Perl 6 regexes.

What follows is rescinded by me; I won't delete the text, but it's here in a small red font to let you know it's (already) out-dated.

That being said, I'm also going to release (if I can figure out how to do it safely) re::capture, which will introduce a new assertion: (?N=pat). It will allow you to specify what capture group you're assigning to. Here's an example of its use:

# parses text like: # name = japhy age = "22" lang = 'Perl' # into a hash... but it retains those pesky quotes :/ my %data = $text =~ m{ ([^=\s]+) \s* = \s* ( ' [^']* ' | " [^"]* " | \S+ ) }xg;
That's pesky because then you have to post-process the quotes out of them. re::capture (isn't that a witty name?) will allow you to say:
# parses text like: # name = japhy age = "22" lang = 'Perl' # into a hash... but doesn't capture the quotes! my %data = $text =~ m{ ([^=\s]+) \s* = \s* (?: ' (?2= [^']* ) ' | " (?2= [^"]* ) " | ( \S+ ) ) }xg;
This case might be resolved in other ways, but it's a good demonstration of what the module does. The other thing I think I'll make it implement are captures that exist only in the regex, and are ignored (that is, not returned) afterwards. That means you can write:
# parses text like: # name = japhy age = "22" lang = 'Perl' # into a hash... but doesn't capture the quotes! my %data = $text =~ m{ ([^=\s]+) \s* = \s* (?: (?*3= ['"] ) (?2= .*? ) \3 | ( \S+ ) ) }xg;
and the regex will only return ($1, $2) each time it matches.

This is not going to be a filter, but rather will work like re, and redefine the functions Perl uses to do its compiling and matching. It won't change much, but it will add support for this new assertion.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Regex Report
by demerphq (Chancellor) on Jun 27, 2004 at 20:30 UTC

    Just something to think about but the dotNet regex library supports named captures:

    (?<NAME>PATTERN)

    While you are hacking this funky stuff maybe such a thing would also be cool. Maybe you could use %+ to hold the captures? So

    if ('demerphq'=~/(?<perlmonk>\w+)/) { print $+{perlmonk} }

    would work. I mean its a bit embarrassing that dotNet has a cool regex feature that perl doesn't. (IMO anyway. :-)

    Thanks for your efforts japhy.

    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


Re: Regex Report
by japhy (Canon) on Jun 28, 2004 at 03:59 UTC
    Ok. Regexp::Parser and the Perl 6 regex parser will be my primary concerns (apart from the article). That whole re-ordered captures and temporary-captures thing will have to wait.

    In the mean time, if you want named captures, I suggest you look at Steve Grazzini's Regexp::Fields, which does what you want. I'm glad I didn't try implementing it -- it doesn't look like a cake-walk.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Regex Report
by Enlil (Parson) on Jun 28, 2004 at 02:23 UTC
    Sounds interesting, and though I cannot ATM think of any situation where I might want to use it, I am almost certain that I have in the past wished for something similar (if not the same). I am not sure if you meant for the last example:
    # parses text like: # name = japhy age = "22" lang = 'Perl' # into a hash... but doesn't capture the quotes! my %data = $text =~ m{ ([^=\s]+) \s* = \s* (?: (?3= ['"] ) (.*? ) \3 | ( \S+ ) ) }xg;
    So that what was previously in the (?2= .* ) would be returned.

    This case might be resolved in other ways,
    Sorry couldn't help presenting one:

    -enlil

    (as a side note I would have to agree with demerphq that named captures would be cool as well)
      I would make your example a little safer:
      my %data = $_ =~ m{ ([^=\s]+) \s* = \s* ["']? ( (?<= ' ) [^']* (?= ' ) | (?<= " ) [^"]* (?= " ) | (?<!['"]) \S+ (?! ['"] ) )["']? }xg;
      That, to me, seems safer, because it ensures a string that starts quoted ends quoted, and a string that doesn't start quoted doesn't end quoted (for some bizarre reason).
      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Regex Report
by diotalevi (Canon) on Jun 28, 2004 at 13:32 UTC

    When you post this, please be sure to not leave important parts of the grammar in inaccessible lexicals. I tried to do something once with YAPE::Regexp but had to write Re: Stealing lexicals - best practice suggestions just to get access to the grammar in my %pat. That really sucked. The only change I really needed was to have %pat be a global so it'd be re-useable.

    Purty please? Won't you think of the children?

      Heh, sorry. You'll be happy to know all the grammar is stored (gasp?) as methods of the object. This means you have method names like "(" and "[" and "|". If you think this is blasphemous, tough cookies. In fact, the only non-weird looking method name is "atom", which is the starting node for the grammar.

      Another thing. Right now, the grammar is determined on the fly. That is, each rule (upon successful match) tells the object what possible rules follow it. Perhaps I should implement that differently.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        So there's code that looks like $next = List::Util::first { eval { $self->$_ } } @tokens? Oof. Holey AUTOLOAD batman! Why not just make the token a parameter to some function instead of passing the value via the function name? Or is this so you can get overriding? When do we get to see this code and are you sure you couldn't have written this using a mundane method?