Trihedralguy has asked for the wisdom of the Perl Monks concerning the following question:

I was wondering if anyone had any past experience with using perl to read in PDF documents and either index them into a database OR make it where a perl script can search multiple PDF's in a directory.
I'm very new to perl, so this is a stretch for me, I just wanted to see if there was anywhere I could start.

Replies are listed 'Best First'.
Re: PDF Indexing / Search
by samtregar (Abbot) on May 30, 2007 at 15:12 UTC
    Take a look at Swish-e. It's a search engine written in Perl and it can be configured to index PDF files using filters that translate PDFs into text or HTML. Once you've got that setup, check out CGI::Application::Search, a very easy to use front-end for Swish-e.

    -sam

      Thanks, maybe I can see how the go about storing their information so I can get idea for my own project.
Re: PDF Indexing / Search
by leocharre (Priest) on May 30, 2007 at 15:11 UTC

    Funny, that's exactly what I'm working on right now.
    I'm using ocr too though. Making it so you can search terms, and it will return page number and where the document is. I'll have a release in a few weeks (open source). It's interface independent. But I'll include a cli and a CGI::Application front end.

    If you think you know what you're doing- I wouldn't mind sharing my outline with you- I have some planning into it already.

    The main point that may be very different for you- is that the archive I am dealing with will have at least 20k pdf files- and they are changing at all times. I expect my data to always be at most an hour old.. or so.. Once the system is fed and working.

Re: PDF Indexing / Search
by philcrow (Priest) on May 30, 2007 at 14:59 UTC
    I think PDF::API2 might be just what you need, but I've never used it to read existing pdfs, only to make new ones. The docs say it can open an existing file and stringify it.

    Phil

    The Gantry Web Framework Book is now available.
Re: PDF Indexing / Search
by derby (Abbot) on May 30, 2007 at 15:23 UTC

    I'd probably set up some type of conversion process (pdftotext) and then just grep those converted files.

    the way too simplistic approach:

    #!/usr/bin/perl use strict; use warnings; die "usage: $0 <pdffile> <searchterm>" unless @ARGV == 2; open( CMD, "/usr/bin/pdftotext $ARGV[0] - |" ) || die "cannot open $ARGV[0]: $!"; while( <CMD> ) { print if( /$ARGV[1]/ ); }

    -derby

    Update: If I was going to do this for real and it needed to be web available, I would probably use Solr.

Re: PDF Indexing / Search
by Trihedralguy (Pilgrim) on May 30, 2007 at 17:53 UTC
    I'm not sure how many PDF's we have. I know we have A LOT though. As they fill up like 10 20" binder looking things. My plan is to make a script that "indexes" the PDFs in a database and then run something like a cron job every night that will check to see if any new documents have been added, or updated from the previous day. If so re-index that file or add it to the index. I have already developed a pretty solid search engine geared at searching databases. (Easy script, but hey Im a perl noob :) ) So basically I think I might just look into putting them into text files and bringing them into the database. How I'm going to store like 10 page documents in a database, i dont know yet.