in reply to Appropriate CPAN namespace for perl parser

As I asked you in private email in reply to your module-author mailing, I'll ask again:
How do you plan on handling the issues raised in my post On Parsing Perl?

-- Randal L. Schwartz, Perl hacker

  • Comment on Re: Appropriate CPAN namespace for perl parser

Replies are listed 'Best First'.
Re: Re: Appropriate CPAN namespace for perl parser
by adamk (Chaplain) on Feb 12, 2002 at 23:32 UTC
    Sorry about the lack of reply Randal... I have a nasty work email situation, so I should get to it shortly. But to answer your question, handling / is currently the most "fuzzy" part. It's the one character I had real problems with.

    Currently, it's written to handle the most common cases only.

    http://ali.as/PSP/source/Perl/Tokenizer/Classes.html line 144 in the browsable source code is the relavent section.

    Since I don't have exposure to the relavent sections of the perl C source, it was fairly difficult, but I'm sure there's a method to use that covers the 99.9% standard.

    With the difficulties in overcoming POD, __END__ etc tags, quote parsing, and the rest mostly solved, I wouldn't want to cancel the whole thing just because of a single character :)
      But that's only one example. It's not just / (divide or regex). It's also dot (concatenate or decimal point), less-than (less than or filehandle read), two less-thans (left shift or here-doc), star (glob or multiply), percent (hash or modulus), ampersand (subroutine or bit-wise and), and question mark (regex or question-colon).

      If you aren't handling all of those, you aren't parsing Perl!

      Put another way, you cannot tokenize Perl without at all times knowing whether you are expecting a value or an operator, because all of the ones I just listed have double duty, depending on context. And yet, to know that, you also need to know if you have a prototyped function to the left that takes args or not. What a mess!

      -- Randal L. Schwartz, Perl hacker

        Does your list of issues you think people need to worry about include things like the (soon to be core) Switch module?

        As soon as you open up the gates to code that uses things like Filter::Simple, it becomes utterly impossible for anything without a working interpreter to figure out how to parse Perl. And that is an idea which looks to be used more and more aggressively as time goes by.

        But if I understand some of the docs correctly, even perl itself doesn't really know what everything is, it guesses based on heuristics etc "Do What I Think"... For example, in deciding what D'oh or s'e'f'g is ( The first evaluates as 'D::oh', the second being equivalent to $_ =~ s/e/f/g;

        If Perl itself has to take educated guesses, can I allow myself the same luxury? As it currently stands, I takes guesses in certain situations which while not as accurate as Perl's, do the job in a percentage of cases, hopefully a large one.

        As the module evolves, I would hope that the guesses get better and better. I personally believe that that is good enough.

        And should the need arise, I'll merge the tokenizer and lexer into a single unit, add prototype checking and context tracking, or whatever else is required ( goddammit :) ), should they be required. I don't plan to be perfect. And given the number of man years spent on perl itself, it's probably a lost cause trying to get all the way to perfect. But that's no reason not to have something that provides value in other ways.

        BTW, thanks for the SLUG visit, I certainly enjoyed it, if only for the 'use base' alone. ( I asked the icky symbol table question )

        Adam