Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Recognizing Perl in text

by Anonymous Monk
on Jan 06, 2011 at 02:20 UTC ( [id://880735]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

If a person was to analyze a man page, can you pick out what lines are perl and what aren't, knowing that perl is likely? I've run across this problem before, and other than thinking about it, did nothing. I've started to play with something now. While, some lines score the same as ordinary text (typically < 1), most lines seem to score above that.

sub perl_score { my $line = shift; my $score = 0; if( $line =~ m| #!\s*/usr/bin/perl| ) { if( length( $line ) > 40 ) { # In line of text, probably. Maybe see how close # to beginning of line, and if comment after? # $score = 0; } elsif( length( $line ) > 20 ) { $score = 1; } else { $score = 2; } return $score; } my @vars = split(/[-\s\(\)\{\}\[\]\<\>=\'\"\~]+/, $line ); foreach my $v (@vars) { $score++ if( $v =~ /^[\$\@\%]/ ); $score++ if( $v =~ /[a-zA-Z]+::[a-zA-Z]+/ ); } my $t = 0; my @chars = qw|( ) { } [ ] = ;|; foreach my $c (@chars) { my $qc = quotemeta( $c ); my @tmp = split(/$qc/, $line ); my $n = $#tmp; $n = $n < 0 ? 0 : $n; $t += $n; } $score += $t / 2; #? $score += 2 if( $line =~ /[=!]~/ ); # Look for reserved words? return $score; }

Replies are listed 'Best First'.
Re: Recognizing Perl in text
by LanX (Saint) on Jan 06, 2011 at 03:20 UTC
    I'd say all extracted lines which pass a syntax check with perl -c without error are per definition perl code.

    (I suppose man pages don't include syntactically wrong code).

    Anyway you'll run into security issues, because avoiding the execution of BEGIN blocks isn't trivial.

    (see Vulnerabilities when editing untrusted code... (Komodo) and especially 847484)

    Another option then should be using PPI for a static parse

    Cheers Rolf

    UPDATE: If you say "analyze a man page" do you mean generated from POD (as IMHO >90% of all mans about Perl should be)? POD has a pretty clear convention to signal embedded code by indentation.

    Best solution anyway would be to parse the original POD instead of the man page, some POD parser (like POD::POM )already allow recognizing code sections (but which are not necessarily Perl-code).

    If you now run a syntax check on those extracted chunks you should get almost 100% secure results.

      I'd say all extracted lines which pass a syntax check with perl -c without error are per definition perl code.
      What language is that? It could be perl, but it could also be PHP, and I expect a few other things. What language it is may perhaps be divined from the surrounding lines of code - it's certainly more likely to be perl if the lines before and after also match your rule.
        > It could be perl, but it could also be PHP

        so what?

        Of course it is Perl and PHP and what ever else, like the words "name","hand" and "finger" are German AND English (well modulo capitalization).

        The OP didn't ask about distinguishing Perl from other languages, but from text.

        Cheers Rolf

      I think a problem with the syntax check idea is, what exactly would you feed to perl -c ?

      If you do it line by line, and have a code snippet like this

      for my $foo (@foo) { for my $bar (@$foo) { push @{ $self->{results} }, { baz => foo( $bar->{baz}, $bar->{quux}[1] ) }; } }

      not a single line (on its own) would pass a syntax check, while taken as a whole, the snippet is perfectly valid Perl code.

      Of course, you could try to work around that problem by passing multiline snippets to the syntax checks, but then the number of possible combinations is going to explode rather soon, even for moderate file sizes...  So you'd at least need some additional heuristic to identify likely beginnings of code sections, or some such, in order to make this approach feasible in practice.

        With a clever strategy it's possible to significantly limit the number of possible chunks to check!

        Simply start checking the most indented line and successively add surrounding lines.

        for my $foo (@foo) { # 8 fails for my $bar (@$foo) { # 6 fails push @{ $self->{results} }, # 5 works { # 3 fails baz => foo( $bar->{baz}, # 2 works $bar->{quux}[1] ) # 1 fails }; # 4 works } # 7 works } # 9 works

        like this the overhead for identifying n lines of code is (statistically) at most linear!

        UPDATE: And it's still possible to rely on the existence of trailing semicolons or braces before running a syntax check.

        Cheers Rolf

        sure, but thats why I added the update about the indentation convention.

        Do you know any man pages with perl code that don't origin from POD? I don't...

        And I agree with Marshall who recommended scanning for trailing /;\s*$/ or /;\s*#.*$/ for a pretty good weighting heuristic.

        Cheers Rolf

      I am working on technical documents in general. Man pages happen to be a particular example that I (and you) are quite familiar with. Sure, the source of the man page as POD is easy to pick these out from. Compiled man pages, or man translated to HTML, LaTeX or PDF would not have the POD markers in it any more. But a person can also find Perl in engineering and science articles, and POD won't be present there. We might see text similar to: "The following Perl program will ..." in the text, which at least tells us that Perl is present in the document.

      But I want to take out the jargon (Perl in this instance), so that I can analyze the prose. At a minimum for spelling.

        > Compiled man pages, or man translated to HTML, LaTeX or PDF would not have the POD markers in it any more

        Multiline code in POD has only one markup: indentation!

        Generated man pages mirror this.

        HTML and PDF additionally flag code with a dedicated style/font.

        So style and indentation are IMHO the by far best criteria for your scoring technique.

        And LaTeX ...You want to parse LaTeX-Code? Seriously?

        Cheers Rolf

Re: Recognizing Perl in text
by Marshall (Canon) on Jan 06, 2011 at 06:11 UTC
    If you are looking for code in general, I would count ";" and give ";\n"; a very,very high ranking - that is very unusual for normal English text - just a couple of those is a "red flag" for not plain text. I would guess that two lines like that is a score of 99.999999. For Perl, look for Perl style comments perhaps ';' followed by '#.*\n'. Not sure what you are wanting.

    Detecting that there is some significant code (5+lines or so) on the page should be easy. Guessing that Perl is part of it probably also (as opposed to C or Java). Extracting the exact lines that are Perl is going to be pretty darn hard.

Re: Recognizing Perl in text
by chrestomanci (Priest) on Jan 06, 2011 at 10:11 UTC

    Perhaps you could use a Bayesian classifier to identify the perl.

    I would take a look at Algorithm::NaiveBayes, and train it with loads of perl (from CPAN), and English (Wikipedia?), and then see how good it is at recognising perl.

Re: Recognizing Perl in text
by Your Mother (Archbishop) on Jan 06, 2011 at 15:12 UTC

    This is from a memory of a discussion I couldn't find and this is just an outline but I think it'll work and be safe (someone will surely shoot that down if it's wrong).

    # Read the file. # Inject 'BEGIN { exit }' at the beginning. # Run 'perl -ce' on the text. # Catch the output (eg, IO::CaptureOutput, Capture::Tiny). # Look for "syntax OK"

    As mentioned, there are a few things that might not actually be Perl that'll parse fine as Perl.

      # Inject 'BEGIN { exit }' at the beginning. ...

      But wouldn't that make any gibberish pass with "syntax OK"?

      $ echo 'BEGIN { exit }' | cat - perl-5.12.2.tar.gz >foo $ perl -c foo foo syntax OK

      Didn't know Perl is written in Perl ;)

        I think you're right! Sorry for that... you could do it without the block but it becomes less "safe" and side-effect free.

Re: Recognizing Perl in text
by sundialsvc4 (Abbot) on Jan 06, 2011 at 16:47 UTC

    Look for the “idioms” of the language, such as keywords like sub and the presence of identifiers beginning with $, @, %, $$, ->.   Score the text in favor of the various candidate languages you think might be there, giving various weights to the idioms that you see, and take the highest candidate score.

    Some idioms are just-about “show stoppers.”   For example, the presence of   <?...?>   pretty well screams   (ick...)   “PHP.”

      That is in part what I was doing here. Some "other" languages:

      2H_2 + O_2 -> 2H_2O Ca-48 + n -> g + Ca-49 -> B- + g + Sc-49 -> B- + g + Ti-49 X = R^{-1} D R

      But the object is to separate the human written language (English) from something else. Perl happens to be a good example of something else, when the document is a man page. Ideally, the documents wouldn't have content of the above examples, but rather an instruction to import that content from somewhere. Skipping over the import instruction would be easy. But people will manually enter stuff like the above anyway.

Re: Recognizing Perl in text
by planetscape (Chancellor) on Jan 07, 2011 at 04:00 UTC

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://880735]
Approved by ww
Front-paged by LanX
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-14 23:14 GMT
Find Nodes?
    Voting Booth?

    No recent polls found