in reply to Recognizing Perl in text

I'd say all extracted lines which pass a syntax check with perl -c without error are per definition perl code.

(I suppose man pages don't include syntactically wrong code).

Anyway you'll run into security issues, because avoiding the execution of BEGIN blocks isn't trivial.

(see Vulnerabilities when editing untrusted code... (Komodo) and especially 847484)

Another option then should be using PPI for a static parse

Cheers Rolf

UPDATE: If you say "analyze a man page" do you mean generated from POD (as IMHO >90% of all mans about Perl should be)? POD has a pretty clear convention to signal embedded code by indentation.

Best solution anyway would be to parse the original POD instead of the man page, some POD parser (like POD::POM )already allow recognizing code sections (but which are not necessarily Perl-code).

If you now run a syntax check on those extracted chunks you should get almost 100% secure results.

Replies are listed 'Best First'.
Re^2: Recognizing Perl in text
by DrHyde (Prior) on Jan 06, 2011 at 10:25 UTC
    I'd say all extracted lines which pass a syntax check with perl -c without error are per definition perl code.
    Really?
    $foo=1
    What language is that? It could be perl, but it could also be PHP, and I expect a few other things. What language it is may perhaps be divined from the surrounding lines of code - it's certainly more likely to be perl if the lines before and after also match your rule.
      > It could be perl, but it could also be PHP

      so what?

      Of course it is Perl and PHP and what ever else, like the words "name","hand" and "finger" are German AND English (well modulo capitalization).

      The OP didn't ask about distinguishing Perl from other languages, but from text.

      Cheers Rolf

        Be careful: so is 'Gift'. ;-)

        The person I was responding to said "I'd say all extracted lines which pass a syntax check with perl -c without error are per definition perl code". This is demonstrably false.

Re^2: Recognizing Perl in text
by Anonyrnous Monk (Hermit) on Jan 06, 2011 at 13:47 UTC

    I think a problem with the syntax check idea is, what exactly would you feed to perl -c ?

    If you do it line by line, and have a code snippet like this

    for my $foo (@foo) { for my $bar (@$foo) { push @{ $self->{results} }, { baz => foo( $bar->{baz}, $bar->{quux}[1] ) }; } }

    not a single line (on its own) would pass a syntax check, while taken as a whole, the snippet is perfectly valid Perl code.

    Of course, you could try to work around that problem by passing multiline snippets to the syntax checks, but then the number of possible combinations is going to explode rather soon, even for moderate file sizes...  So you'd at least need some additional heuristic to identify likely beginnings of code sections, or some such, in order to make this approach feasible in practice.

      With a clever strategy it's possible to significantly limit the number of possible chunks to check!

      Simply start checking the most indented line and successively add surrounding lines.

      for my $foo (@foo) { # 8 fails for my $bar (@$foo) { # 6 fails push @{ $self->{results} }, # 5 works { # 3 fails baz => foo( $bar->{baz}, # 2 works $bar->{quux}[1] ) # 1 fails }; # 4 works } # 7 works } # 9 works

      like this the overhead for identifying n lines of code is (statistically) at most linear!

      UPDATE: And it's still possible to rely on the existence of trailing semicolons or braces before running a syntax check.

      Cheers Rolf

      sure, but thats why I added the update about the indentation convention.

      Do you know any man pages with perl code that don't origin from POD? I don't...

      And I agree with Marshall who recommended scanning for trailing /;\s*$/ or /;\s*#.*$/ for a pretty good weighting heuristic.

      Cheers Rolf

Re^2: Recognizing Perl in text
by Anonymous Monk on Jan 06, 2011 at 17:29 UTC

    I am working on technical documents in general. Man pages happen to be a particular example that I (and you) are quite familiar with. Sure, the source of the man page as POD is easy to pick these out from. Compiled man pages, or man translated to HTML, LaTeX or PDF would not have the POD markers in it any more. But a person can also find Perl in engineering and science articles, and POD won't be present there. We might see text similar to: "The following Perl program will ..." in the text, which at least tells us that Perl is present in the document.

    But I want to take out the jargon (Perl in this instance), so that I can analyze the prose. At a minimum for spelling.

      > Compiled man pages, or man translated to HTML, LaTeX or PDF would not have the POD markers in it any more

      Multiline code in POD has only one markup: indentation!

      Generated man pages mirror this.

      HTML and PDF additionally flag code with a dedicated style/font.

      So style and indentation are IMHO the by far best criteria for your scoring technique.

      And LaTeX ...You want to parse LaTeX-Code? Seriously?

      Cheers Rolf