Venerable Monks,

The law firm I work afor has been given roughly 600,000 emails as evidence in a case that we are working on. These emails have been given to us in 30 directories full of TIFF images and another 30 directories of text files that were OCR'ed from the TIFFs. (Don't ask me why we couldn't have just gotten the actual emails themselves...)

I need to build a tool that will allow the attorneys to search for an arbitrary string in the text files and then provide them with a list of which TIFFs need to be viewed. I have done all this, but it is VERY slow. As I would like to privide a simple CGI interface for them to use, I will need to speed up the process significantly.

I thought about appending the 600k text files together with something like '###email####' as the delimiter. Then I could just set $/ to the delimiter and read in an email at a time. I don't actually know that this would be faster though. I do know that fixed-byte reads are supposed to be faster, but that would involved added operations to make sure that the search term was not split between two chunks.

Also, we use windows exclusively here so I can't just pipe to grep, as I probably would have done normally.

What would the monks do? Any help would be much appreciated.


In reply to Searching many files by ariel2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.