I'm indexing PDFs for a quick search of all of our PDF documentation. I will be sticking them into one column, but I hope none of the columns will come close to the 5000 max because of the fact that I'm doing all this "elimination' of common words, and duplicate words. I may eventually even just limit it to like the first x number of words as I feel if you are looking for a specific document about say apples, the word apples is going to appear withing the first couple of paragraphs at least.
Do you have any other suggestions rather than going this route?
Ulitmatly I'm just indexing the PDFs so that I can repoint back to them later. PDF is a good format for storing massive amounts of documentation, I'm just providing the ability to search all of them at once.
| [reply] |
Yours sounds like an adequate "brute force" method; but if you have the time, you should take a look at RDF (Resource Desciption Format) which is the standard for metadata about documents and other things that a library might consider a "Resource"; its being extended to encompass other things as well; like code and databases; but it started right where you are at now.
I suggest it because there are tools to search RDF for matching resources, based on subject and meaning, rather than just the appearance of certain words.
| [reply] |
While I haven't gone looking quite yet, do you know if these other RDF solutions are perl driven.
I'm trying to do it the "brute force" method because we need something quick, easy, and something that can be completely automated. I will have to at least look into this RDF you speak of.
| [reply] |