Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so?

muba has asked for the wisdom of the Perl Monks concerning the following question:

We all know (as I hope) one of the Perl slogans. The most famous, of course, is the Timtoady one. Next most famous is, I think, "Perl makes Easy Things Easy and Hard Things Possible". Well, with thise two slogans given, I'd like to add another one: "Only perl can parse Perl".

Great! So now I want to write a Perl script (interpreted by perl so hopefully able to parse Perl, because hard things should be possible, in more than one way) to extract subroutines from another Perl script.

Globally, this would be:

search for sub NAME [(PROTOTYPE)] [: ATTRIBUTES]

search for the opening curly, until the matching closing curly.

Well, this shouldn't be too hard. But mind you! What if there are closing curlies within strings? Of course it is not too hard just to ignore everything between quotes. But what if something like qq() or qw() is used? What if the fancy => operator is used? What if here docs are used? And so on...

So... the main question is: how can I extract a subroutine from a Perl file, beginning with the sub keyword, then the name, prototype and attribute specifications, then the opening curly and from there, everything until the closing curly?
I would be glad if this can be done using regexes, but I don't think they're up to the job (unless they become really, really complex). Another possibility is just to scan byte-by-byte, keeping track of opened and closed curly brackets and opened and closed string (this isn't easy, for there are many types of strings, as mentioned above).

To make a long story short, is there an easy way (module or whatever) to easily extract subroutines from a Perl script using a Perl script?

"2b"||!"2b";$$_="the question"
Besides that, my code is untested unless stated otherwise.
One more: please review the article about regular expressions (do's and don'ts) I'm working on.

Comment on Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? Select or Download Code

Replies are listed 'Best First'.
Re: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by hv (Prior) on Sep 24, 2004 at 10:57 UTC
If you need to extract the exact source code, it may be quite difficult. If it is sufficient to get a functional equivalent, look into B::Deparse; from the docs: `use B::Deparse; $deparse = B::Deparse->new("-p", "-sC"); $body = $deparse->coderef2text(\&func);` [download] and here's a specific example as a one-liner: `zen% perl -MB::Deparse -MCarp -wle 'print B::Deparse->new("-p","-sC" +)->coderef2text(\&Carp::confess)' { package Carp; die(longmess(@_)); } zen%` [download] See the rest of the docs for full details of the options available. Hugo	[reply] [d/l] [select]
Re: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by theorbtwo (Prior) on Sep 24, 2004 at 12:17 UTC
Well, the first question to ask is, "can I run the code in order to find out more about it". This isn't just a question of security, mind you -- some scripts take action at BEGIN time that you'd prefer to avoid. If you can't, you run afowl of the meaning of the "Only `perl` can parse Perl" quote -- as something other then the perl interpreter (even if it's written in Perl) trying to parse Perl (the language), you aren't going to be able to do a perfect job. You can do a /decent/ job, using complex regexes, PPI, or even simple regexes and Text::Balanced. If you /can/, then a whole world of introspection opens up to you. `perl` has already parsed the program (or will parse the program, depending on when you're talking about, which depends on how you structure your program.) You can walk the symbol tables to find your subs, and use some B magic to find the begin and end lines. To whit: Given a coderef, running `B::svref_2object` on it will give you an object somewhere in the B tree (I don't quite recall; B::SVRV quite likely; let's call it `$cv`). Running `$cv->GV->LINE`, according to my comments, gives you the line number of the /last/ line of the sub. `$cv->FILE;` will give you the filename. `$cv->START;` will give you the opcode where execution starts. If this ISA B::COP (and it always will be), then you can call ->line on it, to get it's line number. IIRC, that's the line number of the first /statement/ within the sub, not of the `sub foo {` line. Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply] [d/l] [select]
Re: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by Anonymous Monk on Sep 24, 2004 at 08:55 UTC
PPI	[reply]
Re^2: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by muba (Priest) on Sep 24, 2004 at 09:03 UTC
That is a rather complex set of modules. It seems that the documention is (as mentioned) still incomplete. But I thank you for the quick reply and I will play around with PPI for a while, to see if it fits my needs. So, thank you, Anonymous Monk! `"2b"\|\|!"2b";$$_="the question"` Besides that, my code is untested unless stated otherwise. One more: please review the article about regular expressions (do's and don'ts) I'm working on.	[reply] [d/l]
Re: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by ikegami (Patriarch) on Sep 24, 2004 at 14:58 UTC
Text::Balanced does a pretty good job. Parse::RecDescent uses it to extract Perl action blocks from a grammar.	[reply]
Re: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by TedPride (Priest) on Sep 24, 2004 at 10:36 UTC
The following should work fine assuming there are no regular expressions, text assignments, etc. containing { or } inside the subs. With a little additional code (make a copy of the original text, remove harmful code, then search it and return sections of the original) this could be made to work for pretty much anything. Enjoy. `&get_subs('spider.pl', \@arr); print @arr[0]; sub get_subs { my ($start, $n, $inp, $path, $p); $path = shift; $p = shift; open(INP, 'spider.pl') \|\| die; $inp = ' ' . join('', <INP>); close(INP); while ($inp =~ /[\n ]sub[\n ]/g) { $start = pos($inp) - 4; $n = 1; pos($inp) = index($inp, '{', pos($inp)) + 1; while ($inp =~ /(}\|{)/g) { if ($1 eq '}') { if (!(--$n)) { last; } } else { $n++; } } push(@$p, substr($inp, $start, pos($inp) - $start)); } }` [download]	[reply] [d/l]
Re: Extract subroutines from a Perl script. OR: "Only perl can parse Perl." But can it help me to do so? by samtregar (Abbot) on Sep 24, 2004 at 17:24 UTC
To make a long story short, is there an easy way (module or whatever) to easily extract subroutines from a Perl script using a Perl script? To make a long answer short, no. Perl is a context-sensitive language which defies any attempt to write a simple parser. You'll either have to settle for something simple that doesn't handle edge cases or something complex that does. As an aside, you might be intersted in this take on "only perl can parse Perl": Chasing Shadows -sam	[reply]


"be consistent"
	PerlMonks