for instance:
1) comments '#' should (mostly) follow a newline or a semicolon or to be more precise the '#' shouldn't be preceded by a quote-like operator (single s, y tr, q, qq, qr or qw or whatever)
2) strings are closed by the same quote so you need to capture the opening one and check the ending with \1.
3) __DATA__ must appear at line start, OTOH the existence of DATA is already a good indicator for perlcode.
I think discussing single strategies is for vain, in the end you have to test and train different criteria against a suitable big amount of perlmonk posts, to see if the code-sections are found.
With bayes classifier there is a very good mathematical method to combine the probabilities of such methods.
Some of the products I listed in OP use this approach, they are just not trained for perlmonks posts (where tiny code-snippets also appear in text) and have maybe a to heavy footprint to be integrated here.
For instance highlight.js has a function which returns the guessed language.
Cheers Rolf
In reply to Re^4: heuristic to detect (perl) code
by LanX
in thread heuristic to detect (perl) code
by LanX
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |