Balancing number of files against size of files in optimizing access speed

di has asked for the wisdom of the Perl Monks concerning the following question:

I am working with a large text of about 6.5 MB, the words of which will be indexed to the paragraphs in which they occur. Search on a word through a browser interface will return the paragraphs. My question is what might be the optimum number of files in which to store the text from which the paragraphs will be extracted. The text would naturally lend itself to storage in 1, 4, 197, or 1628 files.

Returns could be a few or hundreds - even thousands. My guess is that a few returns would be best (most quickly) extracted from a few small files, whereas a large number of returns would be better extracted from one large file. Is this correct? What are the relative impacts of number and size of files on access speed? What are the criteria for balancing them. Should I simply seek the middle way? Are there other factors I should consider?

Comment on Balancing number of files against size of files in optimizing access speed

Replies are listed 'Best First'.
Re: Balancing number of files against size of files in optimizing access speed by GrandFather (Saint) on Jan 09, 2010 at 21:46 UTC
I suspect 1 file with an .sqlite extension will likely work best. See DBI and DBD::SQLite. True laziness is hard work	[reply]
Re: Balancing number of files against size of files in optimizing access speed by sflitman (Hermit) on Jan 09, 2010 at 22:18 UTC
Life's too short to reinvent the wheel. Use KinoSearch for indexing a corpus of 1628 files, one per paragraph. HTH, SSF	[reply]