how to guess the kind of spoken or programming language a text is written in

leocharre has asked for the wisdom of the Perl Monks concerning the following question:

Is there an already existing system, or modules to detect the kind of code a text chunk is (if any)??

for example, I read in a file, and I want to check it's syntax and figure out if it's html, javascript, perl, c, etc etc.

The thing I saw somehow close on cpan was : HTML::CGIChecker, it detects certain tags in text. Seems like a grat app I wanna use sometime(?).. But... It's not what I am thinking of.

The ideal program would take a chunk of text, and guess what it is.

Also, this program would.. be able to be fed different languages, so you could detect for english, or french, really it would be exactly the same procedures.

It seems to me ann endless set of heuristics are needed. That is.. The program has to be fed tons of code, told what it is. Then with that data, it can determine that the text analized is... say .. 15% perl, 10% javascript and 75% unknown. (therefore clearly simple text)

Reminds me lot of spam assasin. Perhaps it does something similar.

Is this already made? Would it have more then academic value? Is the task much more hairy then my little mind can glimpse?

Comment on how to guess the kind of spoken or programming language a text is written in

Replies are listed 'Best First'.
Re: how to guess the kind of spoken or programming language a text is written in by planetscape (Chancellor) on Aug 12, 2006 at 00:47 UTC
The following nodes may contain helpful suggestions: Idiom guessing script NLP - natural language regex-collections? Constructive criticism of a dictionary / text comparison script Natural language text processing Text Analysis Tools to compare Slinker and Stinker? Natural Language Index Stemming Status of English modules... How good is Perl for AI? English/Language/Grammer Perl NLP HTH, planetscape	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by Zaxo (Archbishop) on Aug 11, 2006 at 22:42 UTC
You can pick an offset and count the number of characters which are equal to each other at that offset. Make the offset large enough and it tends to converge to some percentage. Natural languages show much higher counts than would be expected from random strings of characters. The percentage is characteristic of a language and can be used to identify it. Kahn's popular cryptography book covers this nicely, work of Kasiski and of Friedman. Computer languages are complicated by having an unlimited vocabulary of made-up words in the function and variable names. Better programming, as all perlers know, is more like natural language ;-)) After Compline, Zaxo	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by snowhare (Friar) on Aug 12, 2006 at 01:33 UTC
What you want is N-Gram analysis. Go take a look at the CPAN Search Site and you will find a whole host of modules to help you.	[reply]
Re^2: how to guess the kind of spoken or programming language a text is written in by perlfan (Parson) on Aug 12, 2006 at 13:18 UTC
Speaking of which, the Google Research Blog has a recent, relevent post about this.	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by rhesa (Vicar) on Aug 12, 2006 at 00:05 UTC
See http://blog.sykosopp.com/wp-content/rulethemall.txt, and good luck ;^) Some more polyglots here.	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by Anonymous Monk on Aug 12, 2006 at 00:21 UTC
Whats wrong with the 'file' command? An interface for it here if its really needed: http://search.cpan.org/~pmison/File-Type-0.22/lib/File/Type.pm	[reply]
Re^2: how to guess the kind of spoken or programming language a text is written in by leocharre (Priest) on Aug 12, 2006 at 07:07 UTC
You mean unix 'file' command? ... Whohoa.. I just tried it.. Oh my crackers.. You guys have got to check this out if you don't know this.. Snap up a terminal and do # file yourtext.txt ; and put diff things in there, some c code, some javascript (by default snaps to C++), html .. etc.. I had no clue little ol' "file" did that, i thought it just read mime type off a header somewhere... It's close. Still, it makes an overall deduction .. or guess.. Not sure if it will accept a true mix of code.. . It's insanely sexy.. If you have a text file with long lines it will tell you 'english , really long lines'. I'm gonna pick that some more.. You rock.. Oh Anonymous Monk. Looking through the magic numbers file.... wow.. a lot of work went into this...	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by perlfan (Parson) on Aug 12, 2006 at 00:00 UTC
That is actually a very interesting problem, and it might serve you well to experiment. I suppose it would work best with highly structured languages, but then again this would be easily defeated with the likes of Perl - unless of course you train it with enough codes of sufficient kwalitee.	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by artist (Parson) on Aug 12, 2006 at 10:31 UTC
May be you use AI::categorizer or similar modules along with other methods. --Artist	[reply]
Re: how to guess the kind of spoken or programming language a text is written in by mobby_6kl (Novice) on Aug 13, 2006 at 01:11 UTC
Depends on how good you want the guess to be. For example, something like this might work: `@lang = qw/perl c pascal basic english german french/; print $lang[int(rand(7))];` [download]	[reply] [d/l]