in reply to Re^3: Database in a folder?
in thread Database in a folder?
I am looking for these features:
1) Persistent storage in user editable form which means text files indexed by filenames.
2) Shared storage between threads. That means file locking. It could copy the idea of lock() function.
3) Data amount is small enough for this kind of light database model. No internal caching needed or even allowed.
4) Data will be accessed by a tied variable like ${$key}{$subkey} which should point to file \somedatadir\key which contains subkeys like "subkey=>data" or something similar in a single file.
DBM:Deep looks interesting but I think it does not qualify my requirement n:o 1.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Database in a folder?
by BrowserUk (Patriarch) on Feb 17, 2010 at 18:07 UTC | |
Dealing with the last two first. Those two requirements mean that: I hope it will be clear that whilst your envisaged API would be relatively trivial to implement, even in a thread-safe manner, it would be very slow. Even with the OS filesystem cache working for you--assuming your files are and will remain really quite small--, almost every access requires reading, parsing, and re-writing the entire file each time. And whilst filesystem locking is reliable, it imposes considerable wait-states upon the application. Get two or 3 threads competing for access to (even different keys within) the same file and it could take whole seconds to read/write a single value. Ie. hundreds of thousand of times slower than accessing a variable in memory. The "obvious" thing to do then, is cache the files in memory. To maintain coherence across threads, this would need to be shared memory, which whilst considerably slower than non-shared, is far faster than (even cached) disk. The problems with this are: So, uncached, it is a relatively trivial thing to implement, but will be very, very slow. In the face of concurrency-regardless of whether its processes or threads--life get very complicated, very quickly. Especially if performance is any kind of factor at all. And if you need to cater for both process and thread, concurrency and coherence, it gets very, very complicated--and slow. The archetypical solution to these problems is a serialised, client server architecture--eg. your typical RDBMS--but they are only truly effective if you perform queries and updates on-mass. As soon as you start accessing and updating individual key/value pairs one at a time, you have to factor in the communications, request serialisation, transaction and logging overheads, in addition to the fact that the disk may need to be read (and sometimes written). And of course, along the way you 've lost your primary goal of human editable persistant storage. The simplest mechanism--if you can guarentee only one, multi-threaded process at a time will ever need to be running--would be to load the files at startup into a shared hash of hashes and only write it back to disk when the program shuts down. A slightly more sophisticated model--under the same assumptions as above--would be to wrap over the existing threads::shared tie-like interface, in a second level of tie that demand loaded individual files the first time they are accessed. The problem is that shared hashes are already pretty slow because of their combination of tying and locking. And tie itself isn't quick. Combine the two and you're back to a fairly heavy performance penalty. Though still far, far less than locking, reading and writing entire files for every key/value change. Not what you'll want to hear, but maybe it'll help you reach a decision as to which way to go. A general description of your envisaged application--cgi or command line or gui; short or long running; volumes of data; volumes of keys/subkeys involved--might engender better targetted responses or alternatives for your problem. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by AriSoft (Sexton) on Feb 18, 2010 at 20:20 UTC | |
So, uncached, it is a relatively trivial thing to implement, but will be very, very slow. Thank you for your deep analysis. This solved my problem to find suitable tie module. I understand now better why this is not so trivial (as I presumed) to get an universal solution. The problem with tied system means lots of unnecessary file access and caching means problems with multitasking. I decided to go with my own solution. My program practically keeps all data in memory but when data it is updated it writes the whole record to the file after the record is fully updated.
When the program loads it scans the directory and reads all records in memory. At this time I do not need sharing between processes so I am using shared variables for threads. Otherwise I should check flocks and timestamps every time when I access the record again.
This way the last available data is always in file system and is user modifiable between runs. | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Feb 18, 2010 at 23:23 UTC | |
when data it is updated it writes the whole record to the file That's a nice compromise for data integrity purposes. A possible enhancement, as your data is in shared memory, would be to start another background thread with a queue and offload the write-back to disk from your processing threads, by queueing the primary key (filename) of updated records. When the program loads it scans the directory and reads all records in memory. At this time I do not need sharing between processes As long as your data continues to fit in memory, and your only operating on it from one processor that seems like an effective strategy. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re^5: Database in a folder?
by JavaFan (Canon) on Feb 18, 2010 at 23:49 UTC | |
Consider the following. Your database monitors your flock of sheep, and the number of gold coins in your pocket. ${$sheep}{$count} stored in \somedatadir\sheep and ${$coins}{$gold} stored in \somedatadir\coins. Now you sell a sheep for 15 coins. Now you've got to subtract 1 from one counter, and add 15 to another. What happens if your program fails after finishing one task, and before starting the other? Your data is no longer consistent, and in the best case, you go bankrupt, and in the worst case you'll be arrested after failing the SEC's audit. | [reply] [d/l] [select] |
by Anonymous Monk on Feb 19, 2010 at 03:27 UTC | |
So you don't keep any information in non-ACID compilient storage huh? You're a programmer right, so your source code is pretty important to you. | [reply] |
by AriSoft (Sexton) on Feb 19, 2010 at 17:48 UTC | |
Atomicity. Consistency. Isolation. Durability. Have your ever seen database which has lost consistency and durability giving you a great isolation to your data which practically appears to be broken into atoms? I think you have :-) There must be some reason why /etc/passwd is not in a database. | [reply] |