chuckd has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for someone who might have advice on building a file extraction tool. My company currently uses LAW PreDiscovery to extract text and metadata from files like .msg, .doc, pdf, jpg, gif, etc, etc, etc. This software is an out of the box tool that has many limitations that cause problems for me and other engineers in our group. So, we have been thinking about writing our own tool. I've looked at different modules on CPAN and found many things that I think might help, but don't know if they are any good. Does anyone have any experience builing or writing tools for extracting text from files???? If so what did you use, how did you do it, how big was your project, did you use modules, did you use any API's or .dll's from Microsoft for all of the Microsoft files (.doc, .ppt, .xls, etc.)???

Replies are listed 'Best First'.
Re: advice for a project
by GrandFather (Saint) on Oct 02, 2008 at 02:49 UTC

    How many years have you got to complete this project? CPAN has a large number of modules for dealing with various file formats because there are a large number of file formats and every one of them needs special case code. You are proposing a very large job!


    Perl reduces RSI - it saves typing
Re: advice for a project
by dHarry (Abbot) on Oct 02, 2008 at 09:54 UTC

    I worked on something alike myself in the past. I used a "Pipes and Filters architecture". I liked the idea to be able to plug in new filters for new file formats along the way. I was playing with the concept of content-based-routing of files. I can assure you it was a lot of work! Like most of my projects it was never finished ;-)

    Many people work(ed) on file extraction tools and there are many around, commercial and non-commercial. You might want to take a look at initiatives like: the Metadata Extraction Tool which supports a lot of formats including popular image formats and audio/video formats! It's not a Perl solution though...

    Beware! It’s a * lot * of work to build a tool like this yourself. It’s probably the lesser evil to find something off-the-shelve. Important criteria for selecting such a tool (IMHO):

    • that you can configure it to your own needs
    • override its default behavior
    • extend it with new formats

    HTH

Re: advice for a project
by Corion (Patriarch) on Oct 02, 2008 at 07:09 UTC

    There is no readymade solution. Apache has a search engine, Lucene, and some indexing program whose name escapes me.

      Check also KinoSearch that is a loose port of the Java search engine library Apache Lucene, written in Perl and C.
Re: advice for a project
by Gavin (Archbishop) on Oct 02, 2008 at 18:18 UTC