ariel2 has asked for the wisdom of the Perl Monks concerning the following question:

Venerable Monks,

The law firm I work afor has been given roughly 600,000 emails as evidence in a case that we are working on. These emails have been given to us in 30 directories full of TIFF images and another 30 directories of text files that were OCR'ed from the TIFFs. (Don't ask me why we couldn't have just gotten the actual emails themselves...)

I need to build a tool that will allow the attorneys to search for an arbitrary string in the text files and then provide them with a list of which TIFFs need to be viewed. I have done all this, but it is VERY slow. As I would like to privide a simple CGI interface for them to use, I will need to speed up the process significantly.

I thought about appending the 600k text files together with something like '###email####' as the delimiter. Then I could just set $/ to the delimiter and read in an email at a time. I don't actually know that this would be faster though. I do know that fixed-byte reads are supposed to be faster, but that would involved added operations to make sure that the search term was not split between two chunks.

Also, we use windows exclusively here so I can't just pipe to grep, as I probably would have done normally.

What would the monks do? Any help would be much appreciated.

Replies are listed 'Best First'.
Re: Searching many files
by ariel2 (Beadle) on Feb 10, 2004 at 00:00 UTC
    Thanks for all of the suggestions. I downloaded Swish-e engine and am quite pleased with the end result. It's blazingly fast and, with the help of the SWISH::API module, I had a very nice CGI available for my users in no time.

    As for all the questions about the OCR, I cannot answer them. Unfortunately for my firm, the documents in question have been OCRed by the opposing counsel and the quality does not seem to be all that great. Despite their daunting entry into the "evidence obfuscation contest", Swish-e is doing its job nicely. Of course, there's still not much I can do about bad OCR...

    Now I just need to figure out how to implement the "print all results" button that I've been asked to add to my script ("all results" often seems to entail a few thousand TIFF files) I'm sure there's a node about that here somewhere...

    Thanks again,

    Ariel

      It sounds like the other firm was trying to either obfuscate or ensure integrity of the emails. Probably both. If you had gotten an ascii version, they could claim that anyone could have modified the data. If the OCR was poor quality, just re-OCR it yourselves. Use something like Clara OCR or GOCR.
        It sounds like the other firm was trying to either obfuscate or ensure integrity of the emails. Probably both.
        I'd bet money that the firm had to get their evidence through discovery, which means it's intermediary was a court house. If the court house required print outs that could explain it.

        Want to support the EFF and FSF buy buying cool stuff? Click here.
Re: Searching many files
by kvale (Monsignor) on Feb 09, 2004 at 18:53 UTC
    If you want ot speedup searches over a lot of text, I would recommend creating an index of words and their associated emails. Then break a phrase down into words and search for the intersection of emails that contain all the words.

    -Mark

      Do this and don't reinvent the wheel, use Swish-e.


      -Waswas
Use Indexed Search - Re: Searching many files
by metadoktor (Hermit) on Feb 09, 2004 at 22:56 UTC
    You could significantly speed up your end-users search by implementing a vector-spaced search engine.

    If you don't want to build the engine yourself then perhaps you would want to hire a company specializing in this kind of search. See this businessweek article that talks about a couple of firms specializing in legal search.

    Also you say that you don't have "grep" on Windows...well you can download the grep portion of Cygwin and call grep from perl system calls.

    metadoktor

    "The doktor is in."

Re: Searching many files
by johndageek (Hermit) on Feb 09, 2004 at 22:00 UTC
    Another way to attack this would be to split the data into a couple of files (or at lease seperate out some of the data) so the user can search by subject, from or to addresses as well as by delivery date. (just for kicks, I would change the date to a seconds since epoch format for range sorting.

    This is a sort of indexing, but will minimize the raw volume of data to be searched.

    A point of concern: what character did the ocr insert if there was a recognition failure? Could this invalidate full string searches?

    this does point to a need for an index of words and emails that use them. This would allow the results to be scored as to the odds that is is a hit.

    one way would be to create a file named for each word found in an email. append the email name to the file containing the word. Then push the next email against the previous list to build a directory that would contain files named for each word used in the emails, each file containing the email(or tiff) names that used the words. each file could then be sorted uniquely.

    This set up gives yo ua great deal of flexibility on presentation, as well as search abilities.

    or search for a module to do this work for you, the only thing a module would be missing is the ability to handle missed OCR reads.

    Good Luck!
    dageek

Re: Searching many files
by Vautrin (Hermit) on Feb 09, 2004 at 23:13 UTC
    Out of curiousity, are you OCRing the images from perl (you mentioned TIFFs) or is that background and you have the emails in text form? Also, would it make sense to store all of the emails in a TEXT column in a database, and then use LIKE to find emails with the words / combination of words? Also, what about multithreading? This kind of problem would definitely work in parrallel, since each step doesn't depend on a step before it. Could you spread the work across several computers to speed it up? (i.e. each computer takes 1/nth of the data set)? Just some thoughts nobody's mentioned. Other posters ideas (namely indexing the e-mails) would work extremely well too.

    Want to support the EFF and FSF buy buying cool stuff? Click here.
Re: Searching many files
by rlucas (Scribe) on Feb 09, 2004 at 23:25 UTC
    I'm only half kidding when I say "install a *nix box." "grep -r" is the answer to this question.
      If grep -r is the answer, then someone should have mentioned pgrep, which is available here, among others. The only additional point would be to make it DOS friendly with File::DosGlob.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Searching many files
by inman (Curate) on Feb 10, 2004 at 13:11 UTC
    Don't ask me why we couldn't have just gotten the actual emails themselves

    The reason is that people don't like having their e-mails taken as evidence and used to build a case against them. The TIFFs are probably the result of a bulk scanning of lots of paper that was seized.

    You could use a search engine for this task. It may even be worth getting your law firm to buy one if this is going to be a regular task. visit http://www.searchtools.com/tools/tools-perl.html for a list of Perl related search tools and other search related resources.

    I use the Verity K2 search engine. It is big and very expensive but it works well.