heuristic to detect (perl) code

LanX has asked for the wisdom of the Perl Monks concerning the following question:

I'm meditating about a regex based heuristic to roughly detect if a text paragraph (multilines delimited by '\n\n') is rather perl source code than normal text.

The best idea I had so far was: using regexes to count the line endings with ';' or '}' possibly followed with a '#' part.

Another to check the frequency of words starting with a sigil.

I'm not talking about a valid parser, just a fuzzy detector.

Any better ideas?

One use case could be a JS that checks the contents of a posting in the monastery and warns about missing <code> tags, offering to include them.

(I'm a bit tired of unreadable posts here, and all the following edit-considerations and replies)

Cheers Rolf

PS: I'm not sure if this thread better belongs to PM-Discussions.

Update

Other ideas:

(average) line length: code is shorter than regular text
indentation :: text has rarely indented parts
word frequency :: statistics should show significant frequency differences of keywords in text and code
genetic algorithm trained on archive :: downloading old posts to optimize best mix of different metrics
typical starters: shebang, use strict; ...
Conditional_probability / Naive_Bayes_classifier: combining the results of different checks

Interesting links

highlight.js

SyntaxHighlighter.js

naive bayes classification course (Perl)

identify-programming-languages-with-source-classifier (Ruby)

how-to-detect-programming-language-from-a-string (SO)

detecting-programming-language-from-a-snippet (SO)

Comment on heuristic to detect (perl) code Download Code

Replies are listed 'Best First'.
Re: heuristic to detect (perl) code by tobyink (Canon) on Jan 19, 2013 at 11:45 UTC
`use 5.010; use strict; use warnings; use File::Slurp qw(slurp); my $text = slurp(__FILE__); my $length = length $text; my $perlish = ($text =~ y(@$%;{}[]<>=~)//); my $metric = $perlish / $length; say "Metric is $metric"; if ($metric > 0.10) { say "Looks like code"; } elsif ($metric < 0.03) { say "Looks like text"; } else { say "Debatable"; }` [download] I've only tried this on a few sample inputs, but it hasn't failed once. It correctly detects itself. As you can see, it's a very simple metric, so should be trivial to port to Javascript or anything else. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re^2: heuristic to detect (perl) code by LanX (Saint) on Jan 19, 2013 at 12:03 UTC
yep, it's an extended version of my sigil-frequency idea... but it's a good start, THANKS! =) But IMHO it would be necessary to roughly strip: comments, strings, here-docs and __DATA__, __END__ sections, since adding some comments makes your example already "debatable". BTW: I tried your Nodelethack in Re^3: CSS Show and Tell: Colored Code but it didn't work for me...:( Do you know if CodeMagic provides an API to do code sniffing or is the logic internal? Cheers Rolf	[reply]
Re^3: heuristic to detect (perl) code by 7stud (Deacon) on Jan 20, 2013 at 09:22 UTC
1) Comments: `my $count = ($line =~ s/[#] .? \n//xms); $total += $count;` [download] 2) Strings: `$line =~ s/["'] [^'"\n] ['"]//gxms` [download] 3) use statements: `$count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;` [download] 4) Here docs: `my $count = ($all_text =~ s/<<(\w+) .? \1//gxms); $total += $count;` [download] 5) __DATA__, __END__: `$total++ if $all_text =~ s/(__DATA__\|__END__) .*//xms;` [download] Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.	[reply] [d/l] [select]
Re^4: heuristic to detect (perl) code by Anonymous Monk on Jan 20, 2013 at 10:25 UTC
Re^4: heuristic to detect (perl) code by LanX (Saint) on Jan 20, 2013 at 10:46 UTC
Re: heuristic to detect (perl) code by Anonymous Monk on Jan 19, 2013 at 08:40 UTC
??Perl::PrereqScanner/scan_prereqs ??Perl::MinimumVersion/perlver ??codestat - gather code statistics on the command line	[reply]
Re^2: heuristic to detect (perl) code by LanX (Saint) on Jan 19, 2013 at 08:48 UTC
thanks but you forgot to link to `PPI.js` Cheers Rolf	[reply] [d/l]
Re^3: heuristic to detect (perl) code by Anonymous Monk on Jan 19, 2013 at 09:02 UTC
thanks but you forgot to link to PPI.js PPI is fairly straightforward, s/// is easily converted to .replace, the regex are the simple variety, it is possible OTOH :) Re^2: CSS Show and Tell: Colored Code	[reply]
Re^4: heuristic to detect (perl) code by LanX (Saint) on Jan 19, 2013 at 09:53 UTC