The law firm I work afor has been given roughly 600,000 emails as evidence in a case that we are working on. These emails have been given to us in 30 directories full of TIFF images and another 30 directories of text files that were OCR'ed from the TIFFs. (Don't ask me why we couldn't have just gotten the actual emails themselves...)
I need to build a tool that will allow the attorneys to search for an arbitrary string in the text files and then provide them with a list of which TIFFs need to be viewed. I have done all this, but it is VERY slow. As I would like to privide a simple CGI interface for them to use, I will need to speed up the process significantly.
I thought about appending the 600k text files together with something like '###email####' as the delimiter. Then I could just set $/ to the delimiter and read in an email at a time. I don't actually know that this would be faster though. I do know that fixed-byte reads are supposed to be faster, but that would involved added operations to make sure that the search term was not split between two chunks.
Also, we use windows exclusively here so I can't just pipe to grep, as I probably would have done normally.
What would the monks do? Any help would be much appreciated.
In reply to Searching many files by ariel2
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |