ariel2 has asked for the wisdom of the Perl Monks concerning the following question:
The law firm I work afor has been given roughly 600,000 emails as evidence in a case that we are working on. These emails have been given to us in 30 directories full of TIFF images and another 30 directories of text files that were OCR'ed from the TIFFs. (Don't ask me why we couldn't have just gotten the actual emails themselves...)
I need to build a tool that will allow the attorneys to search for an arbitrary string in the text files and then provide them with a list of which TIFFs need to be viewed. I have done all this, but it is VERY slow. As I would like to privide a simple CGI interface for them to use, I will need to speed up the process significantly.
I thought about appending the 600k text files together with something like '###email####' as the delimiter. Then I could just set $/ to the delimiter and read in an email at a time. I don't actually know that this would be faster though. I do know that fixed-byte reads are supposed to be faster, but that would involved added operations to make sure that the search term was not split between two chunks.
Also, we use windows exclusively here so I can't just pipe to grep, as I probably would have done normally.
What would the monks do? Any help would be much appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Searching many files
by ariel2 (Beadle) on Feb 10, 2004 at 00:00 UTC | |
by Anonymous Monk on Feb 10, 2004 at 00:19 UTC | |
by Vautrin (Hermit) on Feb 10, 2004 at 00:28 UTC | |
by Anonymous Monk on Feb 10, 2004 at 03:36 UTC | |
by ariel2 (Beadle) on Feb 10, 2004 at 00:39 UTC | |
|
Re: Searching many files
by kvale (Monsignor) on Feb 09, 2004 at 18:53 UTC | |
by waswas-fng (Curate) on Feb 09, 2004 at 19:03 UTC | |
|
Use Indexed Search - Re: Searching many files
by metadoktor (Hermit) on Feb 09, 2004 at 22:56 UTC | |
|
Re: Searching many files
by johndageek (Hermit) on Feb 09, 2004 at 22:00 UTC | |
|
Re: Searching many files
by Vautrin (Hermit) on Feb 09, 2004 at 23:13 UTC | |
|
Re: Searching many files
by rlucas (Scribe) on Feb 09, 2004 at 23:25 UTC | |
by QM (Parson) on Feb 10, 2004 at 00:05 UTC | |
|
Re: Searching many files
by inman (Curate) on Feb 10, 2004 at 13:11 UTC |