Dear Monks

I'm in a classic position of having data in a flat file that has grown large over time and is starting to take too long to search. The technology needs to move on. I've searche the databse here and haven't found so much about large constant datbases and hence this...

The existing setup
15,000 records/lines with about 12 pipe delimited fields.
File size - 15 MB
each week we add about 40 new records to the database - it's a growing problem!
It's essentially a constant database and in real time we don't add things in or sort etc.

Of the 12 fields we allow searching on 7 defined fields. Sometimes we just search on one field but much of the time we search on 3 or more fields.

for info the 7 fields we search on occupy about 10Mb of the 15MB total

Searching is done with an old and somewhat modified Salina Sol cgi perl script which iterates over the text file a line at a time.

The script runs under cgi - not mod_perl or any special environment. The webserver is Apache.

Searching on one field takes about 10 seconds to pull up 40 results

and searching on 3 fields takes about 12 secs.

I like the existing setup because we can see the data easily and correct any problems etc.


A new setup

I'd like a system that could handle 45,000 records of the existing size and search them on the same 7 fields in reasonable time - say 5 seconds on my setup.

I'm not too keen to go down a full blown MySQL type route because I belive that searching on multiple fields will be slower then other possible solutions and I don't need the benefits of a relational db particularly. I know some people will be appalled at this view and to them I apologise in advance.

I'm open to other suggestions on the best way forward (and I suppose relational solutions too if they can meet the spec). It occurred to me that I could put all the information in a cdb database, each line having a unique key and have a smaller database, with just the 7 fields, and the key, held in memory. The figures would show that as being about 15Mb in size. If it were done with mod_perl then the searchable bit of the databases would be loaded once from a disk file and thereafter searches should be pretty quick - if with the on-cost of getting the full entries out of the cdb database. But I have never used mod_perl before and my perl experience is pretty low I have to say.

Thanks for your help asnd ideas...


In reply to Large Constant Database in Text File by stephentyler

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.