Re^5: Database in a folder?

Persistant storage in text files - I can understand the motivation behind that.
Shared storage between threads - I come back to this.
No internal caching needed or allowed.
$db{ $key }{ $subkey } = 12345; becomes db\key.dat containing subkey=12345

Dealing with the last two first. Those two requirements mean that:

every read of a value will require
1. the file to be locked (possibly waiting for the lock);
2. open the file;
3. read and parse the file to find the subkey and extract the value;
4. close the file;
5. release the lock.
every update, insertion, or deletion of a value will require:
1. the file to be locked (possibly waiting for the lock);
2. open the file;
3. read and parse the entire file to find all the subkeys and values;
4. change, delete or add the subkey/value pair;
5. Rewind the file;
6. Re-write the entire file (extending or truncating as necessary).
7. close the file;
8. release the lock.

I hope it will be clear that whilst your envisaged API would be relatively trivial to implement, even in a thread-safe manner, it would be very slow. Even with the OS filesystem cache working for you--assuming your files are and will remain really quite small--, almost every access requires reading, parsing, and re-writing the entire file each time.

And whilst filesystem locking is reliable, it imposes considerable wait-states upon the application. Get two or 3 threads competing for access to (even different keys within) the same file and it could take whole seconds to read/write a single value. Ie. hundreds of thousand of times slower than accessing a variable in memory.

The "obvious" thing to do then, is cache the files in memory. To maintain coherence across threads, this would need to be shared memory, which whilst considerably slower than non-shared, is far faster than (even cached) disk. The problems with this are:

that shared memory (in the Threads::Shared sense) is limited to the current process.
Changes will not be reflected on disk until the process "flushes its cache to disk". And that slows everything down to disk speed again.
You now also have the problem of deciding how many files to cache in memory; and for how long.
Too many and/or too long and you risk consuming large amounts of memory. Possibly running out.
Too few or too frequent and you're back to the problems of uncached, multiple disk accesses (lock;read;write;unlock) per subkey access or change.
You also have the not inconsiderable task of ensuring that your database remains coherent in the face of aborts, traps and unscheduled system shutdowns.
Not to mention hardware failures, backups, et. al!

So, uncached, it is a relatively trivial thing to implement, but will be very, very slow.

In the face of concurrency-regardless of whether its processes or threads--life get very complicated, very quickly. Especially if performance is any kind of factor at all. And if you need to cater for both process and thread, concurrency and coherence, it gets very, very complicated--and slow.

The archetypical solution to these problems is a serialised, client server architecture--eg. your typical RDBMS--but they are only truly effective if you perform queries and updates on-mass. As soon as you start accessing and updating individual key/value pairs one at a time, you have to factor in the communications, request serialisation, transaction and logging overheads, in addition to the fact that the disk may need to be read (and sometimes written). And of course, along the way you 've lost your primary goal of human editable persistant storage.

The simplest mechanism--if you can guarentee only one, multi-threaded process at a time will ever need to be running--would be to load the files at startup into a shared hash of hashes and only write it back to disk when the program shuts down.

A slightly more sophisticated model--under the same assumptions as above--would be to wrap over the existing threads::shared tie-like interface, in a second level of tie that demand loaded individual files the first time they are accessed.

The problem is that shared hashes are already pretty slow because of their combination of tying and locking. And tie itself isn't quick. Combine the two and you're back to a fairly heavy performance penalty. Though still far, far less than locking, reading and writing entire files for every key/value change.

Not what you'll want to hear, but maybe it'll help you reach a decision as to which way to go.

A general description of your envisaged application--cgi or command line or gui; short or long running; volumes of data; volumes of keys/subkeys involved--might engender better targetted responses or alternatives for your problem.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Comment on Re^5: Database in a folder? Select or Download Code

Replies are listed 'Best First'.
Re^6: Database in a folder? by AriSoft (Sexton) on Feb 18, 2010 at 20:20 UTC
So, uncached, it is a relatively trivial thing to implement, but will be very, very slow. Thank you for your deep analysis. This solved my problem to find suitable tie module. I understand now better why this is not so trivial (as I presumed) to get an universal solution. The problem with tied system means lots of unnecessary file access and caching means problems with multitasking. I decided to go with my own solution. My program practically keeps all data in memory but when data it is updated it writes the whole record to the file after the record is fully updated. `open (DFH,">",DATADIR.$key) or die; print DFH "$_=>$record{$_}\n" foreach (keys %record); close DFH;` [download] When the program loads it scans the directory and reads all records in memory. At this time I do not need sharing between processes so I am using shared variables for threads. Otherwise I should check flocks and timestamps every time when I access the record again. `open (DFH,"<",DATADIR.$key) or die $key; my %record = split(/=>\|\n/o,<DFH>); close DFH;` [download] This way the last available data is always in file system and is user modifiable between runs.	[reply] [d/l] [select]
Re^7: Database in a folder? by BrowserUk (Patriarch) on Feb 18, 2010 at 23:23 UTC
when data it is updated it writes the whole record to the file That's a nice compromise for data integrity purposes. A possible enhancement, as your data is in shared memory, would be to start another background thread with a queue and offload the write-back to disk from your processing threads, by queueing the primary key (filename) of updated records. When the program loads it scans the directory and reads all records in memory. At this time I do not need sharing between processes As long as your data continues to fit in memory, and your only operating on it from one processor that seems like an effective strategy. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]