For the last few months, I've been spending part of my time on an elaborate parsing project. (See To model or not to model and To make the Model WORK.) My first pass was a regex-based approach, and it worked fairly well, but it was difficult to expand and had problems discovering what wasn't there, i.e., variables that hadn't been defined.

So, I sucked it in and rewrote it with Parse::RecDescent. My first pass used five-element productions to allow me to choose operator precedence (there are seven levels in the grammar), but a cow-orker with a noggin reminded me of Bjarne Stroustrup's elegant Calculator parser. The first one worked, but it took about ten hours for 1800 variable definition equations. Defining a grammar that implemented Stroustrup's cascade got me the results in eight hours, but watching the megabytes of trace stream by showed me that it wasn't storing the intermediate results, it was recalculating everything hundreds of times. It is certainly possible that my understanding of P::RD is the culprit, but I had several cow-orkers give me a read, and they couldn't see anything amiss.

Quandary time: what to do? If I continued to use P::RD, I would rewrite it so that all the work was done in the next-token { action blocks }. This seems to be defeating the purpose of P::RD.

I chose to rewrite everything from the ground up using a raw-language implementation of Stroustrup's algorithm expanded to include the precedence levels, user-defined functions, and nested if-then-else's of the language I'm parsing. It took me a little over two weeks to do the rewrite. The bottom line is that my solution time went from over eight hours to 36 seconds, an improvement of almost three orders of magnitude.

There are times when modules are a gift, and there are times when they get in the way. My raw-language implementation gives me access to everything, and it's not really harder to understand than the P::RD grammar, which was loaded with { action } cases for exceptions like divide-by-zero and nested elements. Each of the precedence levels is handled by a subroutine that's less than twenty pretty-printed lines long, and the whole program took me only about as long to write and debug as I spent on the P::RD grammar in its several passes.

So, before you load up your code with a lot of use XYZ::Foo::Bar statements, think about it. Your language might just be your best friend!

Replies are listed 'Best First'.
Re: On modules
by merlyn (Sage) on Aug 18, 2005 at 14:06 UTC
    TheDamian once explained to me why my attempt to parse all 22 levels of Perl's precedence (at that time) with a recursive descent parser was a really bad idea. I don't fully remember why at this point, but you're obviously stumbling on the same situation. Like everything else, Recursive Descent has plusses and minuses.

    I think your meditation is misnamed then. The problem is not using a module: the problem is using the wrong module simply because you're familiar with it. For example, Parse::Yapp makes a different kind of parser (LALR or something like that) that might have solved your issues lickity split.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      That could be true as well, merlyn. Recursive descent (the methodology) is the right way to do this sort of thing, but what TheDamian's module does isn't exactly classic RD. It is hard to do this in P::RD where it might be better done in a YACC-style parser because of the left recursion issue.

      I was as frustrated with using the module itself as I was with the slowness of the program. Part of the reason I rewrote it was so that I could build my own trace functions and examine my own variables and stacks. I felt that understanding TheDamian's code would have taken me as long as the approach I took, and might not have given me the same results. That's the main reason I focused on the use of modules, and it's a broader concern than just with P::RD. As you say, any module and / or approach has its plusses and minuses and these must be evaluated for each problem one approaches.
        Recursive descent (the methodology) is the right way to do this sort of thing
        No, this is apparently the problem. RD is great for some things, but not for this. There's a reason that gcc is not RD, but LALR. Again, I'm a bear of very little brane about this, but when TheDamian explained it to me, it made sense at the time. It doesn't matter how fast you speed it up: something with 22 levels of precedence should be parsed with LALR (if it can), not RD.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: On modules
by 5mi11er (Deacon) on Aug 18, 2005 at 14:02 UTC
    Good write up; but I'm wondering if you aren't selling modules a little bit short. I would contend that during the various iterations of solving your problem, you probably continued to understand the problem better and more deeply. So, perhaps a better way of viewing your recent experience is that the P::RD iteration of your solution was a stepping stone on your journey to your current iteration?

    -Scott

      Absolutely true. The module worked properly and did solve the problem. I redid it because they want to be able to iterate parameter changes quickly and examine boundary conditions in their simulations.
Re: On modules
by Courage (Parson) on Aug 18, 2005 at 14:26 UTC
    I always considered P::RD as a very good excercise for brain, for understanding parsing techniques in general: excellent documentation, excellent tuturial, very deep.
    I never thought of practical use of it namely due to its inefficiency.

    As another advice, as long as you're deep in parsers, you may consider looking into Haskell: as autrijus showed us all with his marvellous Pugs perl6 implementation, Haskell is extremely good at this.

    But be warned: haskell is very hard to start playing with :)

      Actually, this latest version is written (shhh!) in Ruby. My group is collectively evaluating Ruby and looking for situations in which to test it. Since the SPICE parser is basically an exercise in logical abstraction, it was ideal. Haskell is free with FreeBSD, so I'll take a gander there, too. :) More tools and understanding are always Good Things.

      As 5mi11er suggested, my understanding of the problem space has evolved, and P::RD was an important step along that way.
      When I was a much younger programmer and had the time to delve into the details, I spent a lot of time reading books about LISP implementation and its underlying theory. Alonzo Church's Lambda Calculus was part of that research. I drove my math profs nuts because none of the rest of the class was prepared to go there, but they wanted to and did. We got a lot of interesting conversation accomplished, but the rest of the class was left very confused and the syllabus didn't get completed. ;)

      The nice thing about Perl (or Ruby, for that matter) is that one can do functional programming in it as well as imperative programming. One can look at objects, or one can look at the transition methods and data transformation and information flow. It's all a matter of the level of data abstraction that you model, and the actual language used is immaterial to the understanding embedded in your model.

      Haskell does seem to be a very clean way of expressing abstractions. Perhaps, for my parser/equation solver, it would be very suitable. The next stage of our project is to use my solver in conjuction with the analog simulators and statistical correlative software, and that's an iterative situation more suited to the imperative, input-driven nature of a general-purpose language.

      Looking at the statelessness of it (Haskell), I can see that that would help us were we to be looking at clustering the solution, but I think that there are enough equations to run with that any real cluster processor I could run this on would be happy solving the equations separately, as there are more equations to solve than there would likely be CPU nodes.
Re: On modules
by tilly (Archbishop) on Aug 19, 2005 at 03:05 UTC
    I am willing to make several bets.

    First of all, that the experience of having tried to do it with Parse::RecDescent made the rewrite easier to do.

    Secondly I'll bet that if Parse::RecDescent used /\G.../gc matches to tokenize rather than nibbling strings, its performance would have been good enough. Maybe not as good as your final version, but good enough to make do.

    Thirdly I'll bet that there are more Perl programmers who are inclined to not use modules that would have helped than there are Perl programmers who are inclined to use modules that are not worthwhile. (Though it is certainly possible to err both ways.)

      I was swayed by reports I'd seen from people who swore that P::RD was the cat's meow. You're absolutely right that I gained insight from working with P::RD. Seeing those trace stacks going endlessly through the same fetches and recalculations was mortifying, not edifying! The key to a true RD algorithm is that your early results are saved in the recursive context as you call it again to go deeper. As you come back up the call stack, those results are available.

      As for the tokenizer, I actually did a two-pass system, tokenizing the whole string and then working with tokens saved as TTYPE TVALUE. It would have been faster if I had only done one-token lookahead, but the maintenance advantages of separating the tokenizer from the solver outweighed the need for speed. I suspect that the tokenizer is far less a part of the problem than the recalcs. Even with shortcuts, not having access to the actual tokenset stack to replace sub-strings meant that everything was re-fetched again and again.

      On your third point, it's a tribute to the languages we have now, especially Perl, that one CAN do so much with the base language. I'm soooo glad to have the choice, and to have the Open Source OSen available to run them, AND to have these multi-gigahertz PCs that labor so under Doze that just fly when set free with open source!!!
Re: On modules
by samizdat (Vicar) on Aug 19, 2005 at 14:05 UTC
    tlm has asked me to recap the problem space for everyone's benefit. These equations are extracted from files of SPICE model parameters produced for either Cadence Spectre or HSPICE, which are electronic chip simulators that produce moment-by moment values for how chips behave. Each file contains model parameters for how transistors, capacitors, and such function in a given semiconductor process. The original SPICE simulator that all of these are based on was one of the very first non-OS applications written as Open Source in the 1970's, at UC Berkeley.

    The grammar for these is both deep and recursively defined. Equations can have operators that range from ** down to && and ||, plus units scaling factors, and any given non-operator spot might hold a number, a variable, a library function, a user-defined function, an if-then-else, or another whole equation with or without parentheses. Most operators are left-associative, but ** and units are right-associative.

    An example might be something like this:
    foo = if bar >= 0 then if statistic_p(1, bar/3, 0) > 4 then 1.5u els +eif statistic_p(1, bar/3, 0) > 2 then 1.2u else 1.0u endif else 0.5u +endif
    On top of all of that, == might be written as =, and elseif and elsif are equally valid. It's a hairy mess of a grammar, and our work is complicated by the fact that these values are not guaranteed to be provably correct, only that they will work on a certain high-dollar workstation we do not have the budget to buy. The code is littered with divide-by-zero's, parameters that are not defined in the code, and assumptions made by the workstation that are not documented. Top it off with the fact that the process we are designing for is being changed on the fly by the vendor, and it becomes quite probable that the values I'm given won't be useful two weeks after, so we really need this parser to fly!
Re: On modules
by anonymized user 468275 (Curate) on Aug 22, 2005 at 16:25 UTC
    I agree with the OP: The (most reasonable) design process consists of first identifying the functional (i.e. language-independent) algorithm that solves the problem in the theoretically most appropriate way.

    After that, modules may well exist to fill in some of the coding requirements, but it is often wrong to use a module which misses the mark even slightly. For example, if a module can do everything with a set a files except just one of the things you need, it might be hard to justify doubling the number of passes through all the files just to use the module for most of the requirements, especially if there are hundreds of files adding up to several gigabytes of data to be processed.

    One world, one people