PDF Indexing / Search

Trihedralguy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: PDF Indexing / Search by samtregar (Abbot) on May 30, 2007 at 15:12 UTC
Take a look at Swish-e. It's a search engine written in Perl and it can be configured to index PDF files using filters that translate PDFs into text or HTML. Once you've got that setup, check out CGI::Application::Search, a very easy to use front-end for Swish-e. -sam	[reply]
Re^2: PDF Indexing / Search by Trihedralguy (Pilgrim) on May 30, 2007 at 19:47 UTC
Thanks, maybe I can see how the go about storing their information so I can get idea for my own project.	[reply]
Re: PDF Indexing / Search by leocharre (Priest) on May 30, 2007 at 15:11 UTC
Funny, that's exactly what I'm working on right now. I'm using ocr too though. Making it so you can search terms, and it will return page number and where the document is. I'll have a release in a few weeks (open source). It's interface independent. But I'll include a cli and a CGI::Application front end. If you think you know what you're doing- I wouldn't mind sharing my outline with you- I have some planning into it already. The main point that may be very different for you- is that the archive I am dealing with will have at least 20k pdf files- and they are changing at all times. I expect my data to always be at most an hour old.. or so.. Once the system is fed and working.	[reply]
Re: PDF Indexing / Search by philcrow (Priest) on May 30, 2007 at 14:59 UTC
I think PDF::API2 might be just what you need, but I've never used it to read existing pdfs, only to make new ones. The docs say it can open an existing file and stringify it. Phil The Gantry Web Framework Book is now available.	[reply]
Re: PDF Indexing / Search by derby (Abbot) on May 30, 2007 at 15:23 UTC
I'd probably set up some type of conversion process (pdftotext) and then just grep those converted files. the way too simplistic approach: `#!/usr/bin/perl use strict; use warnings; die "usage: $0 <pdffile> <searchterm>" unless @ARGV == 2; open( CMD, "/usr/bin/pdftotext $ARGV[0] - \|" ) \|\| die "cannot open $ARGV[0]: $!"; while( <CMD> ) { print if( /$ARGV[1]/ ); }` [download] -derby Update: If I was going to do this for real and it needed to be web available, I would probably use Solr.	[reply] [d/l]
Re: PDF Indexing / Search by Trihedralguy (Pilgrim) on May 30, 2007 at 17:53 UTC
I'm not sure how many PDF's we have. I know we have A LOT though. As they fill up like 10 20" binder looking things. My plan is to make a script that "indexes" the PDFs in a database and then run something like a cron job every night that will check to see if any new documents have been added, or updated from the previous day. If so re-index that file or add it to the index. I have already developed a pretty solid search engine geared at searching databases. (Easy script, but hey Im a perl noob :) ) So basically I think I might just look into putting them into text files and bringing them into the database. How I'm going to store like 10 page documents in a database, i dont know yet.	[reply]