stevieb has asked for the wisdom of the Perl Monks concerning the following question:

Now that I'm back home, we're getting a bit back to normalcy. Going to stay in one of my cabins tonight (~500 metres from my main house) and do some fishing (rainbow trout today).

While I'm gone, I'm hoping the Monks can chew on a couple of regexes I've got. Essentially, it is part of my Devel::Examine::Subs distribution, and its purpose is to identify subroutine definitions/declarations within a Perl file. Due to PPI, I already know where these things are, but I'm throwing as a challenge whether these regexen could be improved on. I've tried to imagine all scenarios where a sub definition could be on a single or multi-line here.

my $single_line = qr/ sub\s+\w+\s*(?:\(.*?\)\s+)?\{\s*(?!\s*[\S]) | sub\s+\{\s*(?!\s*[\S]) /x; my $multi_line = qr/sub\s+\w+\s*(?![\S])/;

I'd love for regex experts to show me examples of sub def lines that the regexes won't catch, and fixes to them. I'm especially interested in help with fixes to catch signatures and prototypes ;)

Replies are listed 'Best First'.
Re: Regex critique
by LanX (Saint) on Sep 07, 2018 at 21:25 UTC
    Hi stevieb

    It's said that every programmer has to invent a template system or a vocabulary trainer.

    And it seems every Perl hacker has to try to parse Perl. ;-P

    ... well ... how can I help you?

    Do you want ...

    • ... links to other projects trying to do the job with regexes in order to have a look into the code? *
    • ... suggestions how to make your regex better maintainable by nesting regexes for different sub-grammars?
    • ... a link where Merlyn shows that Perl can't be statically parsed because any imported sub with prototype can change how the parser proceeds?

    Just a simple example

    • you need to exclude subs inside POD or strings from your listing.
    • strings are not easily detected because we have q and qq operators.
    • ...
      q! sub :attr() { print "bla" } !
    edit

    *) like PPR,

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    update

    From your module

    > Files are parsed using PPI, not by inspecting packages or op-trees.

    I think it's clearer if you say

    > Files are statically parsed using PPI, not by dynamically inspecting packages or coderefs.

Re: Regex critique
by Corion (Patriarch) on Sep 08, 2018 at 06:10 UTC

    Depending on how much you can constrain your input, you could look at the regex I use in Filter::signatures to recognize subroutine signatures. The regexes somewhat cheat as they leave parsing of quotes and quote-likes to Filter::Simple / Text::Balanced, so you either need to use these modules or restrict your code to sane / easy string constructs.

    The most simple version is this. It is compatible with 5.8, but doesn't handle nested parentheses.

    # This is the version that is most downwards compatible but doesn't ha +ndle # parentheses in default assignments sub transform_arguments { # This should also support # sub foo($x,$y,@) { ... }, throwing away additional arguments # Named or anonymous subs no warnings 'uninitialized'; s{\bsub(\s*)(\w*)(\s*)\((\s*)((?:[^)]*?\@?))(\s*)\)(\s*)\{}{ parse_argument_list("$2","$5","$1$3$4$6$7") }mge; $_ }

    This fails for example for:

    sub fail58( $time = localtime() ) {

    The recursive regex used for 5.010 onwards is more complex because it handles matched parentheses and curly braces.

    sub transform_arguments { # We also want to handle arbitrarily deeply nested balanced parent +heses here no warnings 'uninitialized'; s{\bsub(\s*) #1 (\w*) #2 (\s*) #3 \( (\s*) #4 ( #5 ( #6 (?: \\. # regex escapes and references | \( (?6)? # recurse for parentheses \) | \{ (?6)? # recurse for curly brackets \} | (?>[^\\\(\)\{\}]+) # other stuff )+ )* \@? # optional slurpy discard argument + at the end ) (\s*)\) (\s*)\{}{ parse_argument_list("$2","$5","$1$3$4$8$9") }mgex; $_ }

    If this is for parsing your own code, you usually can rewrite/restrict your code to a limited subset of what Perl allows. If you want a generic "subroutine declaration finder", you have a hard task in front of you.

Re: Regex critique ( intention? )
by beech (Parson) on Sep 07, 2018 at 21:29 UTC

    Hi,

    PPI has no problem with this

    sub # keyword Routine # identifier { # new block ' sub # keyword Routine # identifier { # new block "PPI" } ' }