Zadeh has asked for the wisdom of the Perl Monks concerning the following question:

As an ad-hoc method, it's been common in some perl apps I use to maintain an ever-larger list of extensions (.c, .cpp, .h, .pl, ...) to recognize if a file is source code. I can think of a number of problems with this approach:

1) All files have to have an extension. It's not uncommon for people to save scripts and makefiles without one.
2) There's an implicit assumption that there is a one-to-one mapping between each unique extension and the kind of content it should have.
3) You have to continually maintain a list of these extensions.

There's got to be a better way. From within *nix I might often do something like this

$ file -s some_file.c

and then see:

some_file.c: ASCII C program text

This brings me to some more questions: Is there a nice tidy perl module to accomplish this effect? If not, how best to implement it?

Replies are listed 'Best First'.
Re: How best to identify & Categorizing Source Code?
by perrin (Chancellor) on Mar 31, 2008 at 22:50 UTC
      I made a go at this initially, but the only thing it returns so far is "text/plain" which doesn't help much. What am I missing?
        It should be using a technique very similar to the "file" command. Try feeding it your /etc/magic file.
Re: How best to identify & Categorize Source Code?
by apl (Monsignor) on Apr 01, 2008 at 09:55 UTC
    If you're on *nix, you could read the first line to see what compiler/interpreter/shell is invoked....
Re: How best to identify & Categorize Source Code?
by Arunbear (Prior) on Apr 01, 2008 at 18:40 UTC
    There is File::Comments, though it is alpha software according to its docs. Alternatively there is File for Windows which may be useful if you're on win32.
Re: How best to identify & Categorize Source Code?
by Errto (Vicar) on Apr 01, 2008 at 19:36 UTC
    Try File::Type. I've used it only a bit, but it at least claims to fix some of the problems with File::MMagic.