Re^2: Database in a folder?

Replies are listed 'Best First'.
Re^3: Database in a folder? by BrowserUk (Patriarch) on Feb 16, 2010 at 22:26 UTC
Are you seeking to persist your data between runs of your program? Or offload the data to disk becuase of volume and memory constraints? If the former, it should be relatively trivial and safe. If the latter, with threads involved, it requires considerable thought and care. You might look at DBM::Deep, though there doesn't appear to be any mention of thread-safety one way or the other. It can certainly handle the hashes of hashes aspect, and has ACID transactions which might mean it would work from multiple threads. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]
Re^4: Database in a folder? by AriSoft (Sexton) on Feb 17, 2010 at 16:24 UTC
I am looking for these features: 1) Persistent storage in user editable form which means text files indexed by filenames. 2) Shared storage between threads. That means file locking. It could copy the idea of lock() function. 3) Data amount is small enough for this kind of light database model. No internal caching needed or even allowed. 4) Data will be accessed by a tied variable like ${$key}{$subkey} which should point to file \somedatadir\key which contains subkeys like "subkey=>data" or something similar in a single file. DBM:Deep looks interesting but I think it does not qualify my requirement n:o 1.	[reply]
Re^5: Database in a folder? by BrowserUk (Patriarch) on Feb 17, 2010 at 18:07 UTC
Persistant storage in text files - I can understand the motivation behind that. Shared storage between threads - I come back to this. No internal caching needed or allowed. `$db{ $key }{ $subkey } = 12345;` becomes `db\key.dat` containing `subkey=12345` Dealing with the last two first. Those two requirements mean that: every read of a value will require the file to be locked (possibly waiting for the lock); open the file; read and parse the file to find the subkey and extract the value; close the file; release the lock. every update, insertion, or deletion of a value will require: the file to be locked (possibly waiting for the lock); open the file; read and parse the entire file to find all the subkeys and values; change, delete or add the subkey/value pair; Rewind the file; Re-write the entire file (extending or truncating as necessary). close the file; release the lock. I hope it will be clear that whilst your envisaged API would be relatively trivial to implement, even in a thread-safe manner, it would be very slow. Even with the OS filesystem cache working for you--assuming your files are and will remain really quite small--, almost every access requires reading, parsing, and re-writing the entire file each time. And whilst filesystem locking is reliable, it imposes considerable wait-states upon the application. Get two or 3 threads competing for access to (even different keys within) the same file and it could take whole seconds to read/write a single value. Ie. hundreds of thousand of times slower than accessing a variable in memory. The "obvious" thing to do then, is cache the files in memory. To maintain coherence across threads, this would need to be shared memory, which whilst considerably slower than non-shared, is far faster than (even cached) disk. The problems with this are: that shared memory (in the Threads::Shared sense) is limited to the current process. Changes will not be reflected on disk until the process "flushes its cache to disk". And that slows everything down to disk speed again. You now also have the problem of deciding how many files to cache in memory; and for how long. Too many and/or too long and you risk consuming large amounts of memory. Possibly running out. Too few or too frequent and you're back to the problems of uncached, multiple disk accesses (lock;read;write;unlock) per subkey access or change. You also have the not inconsiderable task of ensuring that your database remains coherent in the face of aborts, traps and unscheduled system shutdowns. Not to mention hardware failures, backups, et. al! So, uncached, it is a relatively trivial thing to implement, but will be very, very slow. In the face of concurrency-regardless of whether its processes or threads--life get very complicated, very quickly. Especially if performance is any kind of factor at all. And if you need to cater for both process and thread, concurrency and coherence, it gets very, very complicated--and slow. The archetypical solution to these problems is a serialised, client server architecture--eg. your typical RDBMS--but they are only truly effective if you perform queries and updates on-mass. As soon as you start accessing and updating individual key/value pairs one at a time, you have to factor in the communications, request serialisation, transaction and logging overheads, in addition to the fact that the disk may need to be read (and sometimes written). And of course, along the way you 've lost your primary goal of human editable persistant storage. The simplest mechanism--if you can guarentee only one, multi-threaded process at a time will ever need to be running--would be to load the files at startup into a shared hash of hashes and only write it back to disk when the program shuts down. A slightly more sophisticated model--under the same assumptions as above--would be to wrap over the existing threads::shared tie-like interface, in a second level of tie that demand loaded individual files the first time they are accessed. The problem is that shared hashes are already pretty slow because of their combination of tying and locking. And tie itself isn't quick. Combine the two and you're back to a fairly heavy performance penalty. Though still far, far less than locking, reading and writing entire files for every key/value change. Not what you'll want to hear, but maybe it'll help you reach a decision as to which way to go. A general description of your envisaged application--cgi or command line or gui; short or long running; volumes of data; volumes of keys/subkeys involved--might engender better targetted responses or alternatives for your problem. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]
Re^6: Database in a folder? by AriSoft (Sexton) on Feb 18, 2010 at 20:20 UTC
Re^7: Database in a folder? by BrowserUk (Patriarch) on Feb 18, 2010 at 23:23 UTC
Re^5: Database in a folder? by JavaFan (Canon) on Feb 18, 2010 at 23:49 UTC
I'm glad you didn't list anything that people actually expect from a database. Atomicity. Consistency. Isolation. Durability. Consider the following. Your database monitors your flock of sheep, and the number of gold coins in your pocket. `${$sheep}{$count}` stored in `\somedatadir\sheep` and `${$coins}{$gold}` stored in `\somedatadir\coins`. Now you sell a sheep for 15 coins. Now you've got to subtract 1 from one counter, and add 15 to another. What happens if your program fails after finishing one task, and before starting the other? Your data is no longer consistent, and in the best case, you go bankrupt, and in the worst case you'll be arrested after failing the SEC's audit.	[reply] [d/l] [select]
Re^6: Database in a folder? by Anonymous Monk on Feb 19, 2010 at 03:27 UTC
Re^6: Database in a folder? by AriSoft (Sexton) on Feb 19, 2010 at 17:48 UTC