leocharre has asked for the wisdom of the Perl Monks concerning the following question:

Is there an already existing system, or modules to detect the kind of code a text chunk is (if any)??

for example, I read in a file, and I want to check it's syntax and figure out if it's html, javascript, perl, c, etc etc.

The thing I saw somehow close on cpan was : HTML::CGIChecker, it detects certain tags in text. Seems like a grat app I wanna use sometime(?).. But... It's not what I am thinking of.

The ideal program would take a chunk of text, and guess what it is.

Also, this program would.. be able to be fed different languages, so you could detect for english, or french, really it would be exactly the same procedures.

It seems to me ann endless set of heuristics are needed. That is.. The program has to be fed tons of code, told what it is. Then with that data, it can determine that the text analized is... say .. 15% perl, 10% javascript and 75% unknown. (therefore clearly simple text)

Reminds me lot of spam assasin. Perhaps it does something similar.

Is this already made? Would it have more then academic value? Is the task much more hairy then my little mind can glimpse?

  • Comment on how to guess the kind of spoken or programming language a text is written in

Replies are listed 'Best First'.
Re: how to guess the kind of spoken or programming language a text is written in
by planetscape (Chancellor) on Aug 12, 2006 at 00:47 UTC
Re: how to guess the kind of spoken or programming language a text is written in
by Zaxo (Archbishop) on Aug 11, 2006 at 22:42 UTC

    You can pick an offset and count the number of characters which are equal to each other at that offset. Make the offset large enough and it tends to converge to some percentage.

    Natural languages show much higher counts than would be expected from random strings of characters. The percentage is characteristic of a language and can be used to identify it.

    Kahn's popular cryptography book covers this nicely, work of Kasiski and of Friedman.

    Computer languages are complicated by having an unlimited vocabulary of made-up words in the function and variable names. Better programming, as all perlers know, is more like natural language ;-))

    After Compline,
    Zaxo

Re: how to guess the kind of spoken or programming language a text is written in
by snowhare (Friar) on Aug 12, 2006 at 01:33 UTC
    What you want is N-Gram analysis. Go take a look at the CPAN Search Site and you will find a whole host of modules to help you.
      Speaking of which, the Google Research Blog has a recent, relevent post about this.
Re: how to guess the kind of spoken or programming language a text is written in
by rhesa (Vicar) on Aug 12, 2006 at 00:05 UTC
Re: how to guess the kind of spoken or programming language a text is written in
by Anonymous Monk on Aug 12, 2006 at 00:21 UTC
    Whats wrong with the 'file' command? An interface for it here if its really needed: http://search.cpan.org/~pmison/File-Type-0.22/lib/File/Type.pm

      You mean unix 'file' command? ... Whohoa.. I just tried it.. Oh my crackers.. You guys have got to check this out if you don't know this.. Snap up a terminal and do # file yourtext.txt ; and put diff things in there, some c code, some javascript (by default snaps to C++), html .. etc.. I had no clue little ol' "file" did that, i thought it just read mime type off a header somewhere...

      It's close. Still, it makes an overall deduction .. or guess.. Not sure if it will accept a true mix of code.. . It's insanely sexy.. If you have a text file with long lines it will tell you 'english , really long lines'. I'm gonna pick that some more.. You rock.. Oh Anonymous Monk.

      Looking through the magic numbers file.... wow.. a lot of work went into this...

Re: how to guess the kind of spoken or programming language a text is written in
by perlfan (Parson) on Aug 12, 2006 at 00:00 UTC
    That is actually a very interesting problem, and it might serve you well to experiment. I suppose it would work best with highly structured languages, but then again this would be easily defeated with the likes of Perl - unless of course you train it with *enough* codes of sufficient kwalitee.
Re: how to guess the kind of spoken or programming language a text is written in
by artist (Parson) on Aug 12, 2006 at 10:31 UTC
    May be you use AI::categorizer or similar modules along with other methods.
    --Artist
Re: how to guess the kind of spoken or programming language a text is written in
by mobby_6kl (Novice) on Aug 13, 2006 at 01:11 UTC
    Depends on how good you want the guess to be. For example, something like this might work:
    @lang = qw/perl c pascal basic english german french/; print $lang[int(rand(7))];