in reply to Searching many files

Thanks for all of the suggestions. I downloaded Swish-e engine and am quite pleased with the end result. It's blazingly fast and, with the help of the SWISH::API module, I had a very nice CGI available for my users in no time.

As for all the questions about the OCR, I cannot answer them. Unfortunately for my firm, the documents in question have been OCRed by the opposing counsel and the quality does not seem to be all that great. Despite their daunting entry into the "evidence obfuscation contest", Swish-e is doing its job nicely. Of course, there's still not much I can do about bad OCR...

Now I just need to figure out how to implement the "print all results" button that I've been asked to add to my script ("all results" often seems to entail a few thousand TIFF files) I'm sure there's a node about that here somewhere...

Thanks again,

Ariel

Replies are listed 'Best First'.
Re: Re: Searching many files
by Anonymous Monk on Feb 10, 2004 at 00:19 UTC
    It sounds like the other firm was trying to either obfuscate or ensure integrity of the emails. Probably both. If you had gotten an ascii version, they could claim that anyone could have modified the data. If the OCR was poor quality, just re-OCR it yourselves. Use something like Clara OCR or GOCR.
      It sounds like the other firm was trying to either obfuscate or ensure integrity of the emails. Probably both.
      I'd bet money that the firm had to get their evidence through discovery, which means it's intermediary was a court house. If the court house required print outs that could explain it.

      Want to support the EFF and FSF buy buying cool stuff? Click here.
        If you want to re-OCR, the "best" on a PC is ABBYY finereader. It is the OCR of choice by all the book scanning groups, and far better than any other competition. It can handle tiff's, is fast, and probably would be worth trying out.
        This could be true. I am new to the legal world and have no idea how that stuff works. I did hear, however, that the OCR was done off-shore to save money, which I think is mildly interesting, as I didn't know that companies did that sort of thing, though i guess that makes sense...