Using (say) Parse::RecDescent to glean from a file

locked_user sundialsvc4 has asked for the wisdom of the Perl Monks concerning the following question:

“Gleaning” is the best word to think of to describe what I want to do here. I’ll be combing through large SAS programs (and SQL queries and yah-yah) looking for very specific needles in the haystacks. The rest I want to ignore. (And, I want the approach to be pretty flexible because some of these programs are “old and nas-s-s-ty.”)

These “needles” are not simply string-patterns. It seems to me that the best way to describe them is in terms of a grammar (or sub-grammar). But it will have a large number of “here be dragons” gaps in it, and these gaps are okay. If I decide that I don’t care about it, I don’t want the program to be tripped-up by it. And I don’t want to have to describe in any sort of detail what it is that I don’t care about. (I know of about 9,000 input files so-far, and there may be many more.)

I can predict that, as time goes on with this project, we’ll find new things that we want to “glean” for. Looking for specific macros, for example, and picking certain things out of them. So, we’ll be re-mining this same vein again and again and again.

Speaking of macros, macro expansion is a whole ’nuther kettle of fish here. I’m going to need to be able to find the various &let statements in the code (e.g. in a “first pass,” then expand the &varname entries as best I can, and then re-parse. (And these guys put macros in their macro-expansions, sometimes six or seven levels deep.)

What I emphatically don’t want to find myself doing is ... effectively, starting over, or playing whack-a-mole on a big ugly wad of (my...) code that is constantly growing bigger and uglier as requirements evolve. And, although I know regexes very well, I really want to try to stay clear of “regex hell.” This is a task for a lexer and a parser.

It is also attracting a lot of high-management attention... which, as we all know, is both a very good thing and a not so very good thing.

Parse::RecDescent is already being used, to very good effect. Other parsers would be much more problematic to deploy. Computing resources are plentiful and fast.

Any thoughts on a general, high-level approach to this sort of problem?

Replies are listed 'Best First'.
Re: Using (say) Parse::RecDescent to glean from a file by moritz (Cardinal) on Oct 10, 2010 at 08:57 UTC
Any thoughts on a general, high-level approach to this sort of problem? A general, high-level approach is to reuse the parser that ends up parsing (and preprocessing) the text files in the existing applications. Or an existing parser, such as SAS::Parser/SQL::Parser. You have given us a high level overview, but not many details about what you want to parse, how you know what to search for, how these patterns look etc., so it's hard to know what kind of approaches might be feasible. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re: Using (say) Parse::RecDescent to glean from a file by locked_user sundialsvc4 (Abbot) on Oct 10, 2010 at 14:52 UTC
Okay, first of all, I had overlooked ... `:{` ... SAS::Parser. Mea culpa* on that one. One thing that I know we’ll be doing is looking at the SQL statements that are issued. Sometimes we are looking in a mixture of in-line code of different types. (Over the years, many now long-gone programmers did things in different ways.) The general notion that I have right now is that ... “somewhere in all this unpredictable mess is a particular string (of tokens) that I am looking for.” When and if I find it, I know that “this thing” has the following general structure, which I would now like to parse. And, when I do parse it, I want to look for particular things and ignore the rest. One strategy, of course, is to use regular expressions or what-not to carve out chunks of source-code that can then be subjected to a regular parser. But there are two elements to that approach which give me pause: “Now I am writing, and maintaining, two sets of code that are jointly responsible for handling the file. Within each isolated chunk of code that I want to parse, I still fear that I would have to provide complete-and-correct grammar information. So, from a parsing point-of-view, it is rather like looking for the good-stuff in a file chock-full of “syntax errors.” (Kind of like the Gary Larson cartoon about “what a dog actually hears” ... blah blah blah blah dog food blah blah blah play catch blah blah blah ... I want to be able to tell the computer about dog-food and games; about what I am truly interested in and nothing else.)