in reply to Re^4: regex issue
in thread regex issue

Perl is impossible to parse in the same sense that it's impossible to determine whether an arbitrary program halts.

Given the following Perl script:

#!/usr/bin/perl print("1\n");

It should certainly be possible to parse it (without executing it), and it should also be possible to detect that it will halt.

However, a parser cannot be written that will take any arbitrary valid Perl scripts as input, and always produce a parse tree as output without executing the program.

PPI can parse a very large subset of Perl scripts. It does so very well, but there will always be some scripts it simply can't decide. The canonical example is:

whatever / 25 ; # / ; die "this dies!";

Which can be parsed two very different ways depending on the prototype of whatever. If it has a prototype of () then it takes no arguments, so it's interpreted as the following, plus a comment:

whatever() / 25;

If whatever has a prototype of ($), so takes an argument, then it is interpreted as:

whatever($_ =~ m{ 25 ; # }); die "this dies!";

If the prototype of whatever is determined at runtime, e.g.:

BEGIN { *sum = sub ($$) { (shift) + (shift) }; *whatever = (sum(2,2) == 5) ? sub ($) {} : sub () {}; }

then the Perl cannot be parsed without executing part of it. (The parser needs to call the sub sum.)

Which is not to say that PPI and the other fine projects you mention are without value. Parsing a large subset of Perl is still very useful. Having a large subset of a fortune, is better than having no money at all.

Replies are listed 'Best First'.
Re^6: regex issue
by JavaFan (Canon) on Feb 17, 2012 at 12:30 UTC
    Perl is impossible to parse in the same sense that it's impossible to determine whether an arbitrary program halts.
    Exactly. Which means that in most cases, it is possible (see PPI, see perlcritic, see perltidy, see the Javascript compiler, see the syntax highlighters). The problem the OP was facing with his syntax highlighting had nothing to whether they may exist a program that cannot be fully parsed with executing some of it.