oh crap, man.. you bring tears of joy and insanity to my eyes. this is what i lost the first 2 weeks of this month on.
Im making a web app to let people interact with files accross the web. yey!.. Suddenly I feel like i just walked into a party with beautiful people and i forgot to put my pants on. oh, and i'm covered in excrement.
so users have to be able to find files, etc .. get info on files.. and there are a *lot* of freaking files.
all security issues aside (just because they are out of context for this node)... i went out on an insane chase to rebuild a filesystem
and i had no idea i was doing it.
i need a mysql table that stores stuff on files- just like you say.. md5 etc .. everything.
An indexer.
the first thing it did was find every freaking dir and make sure it was in the db table, it would use find to get mod times too. then it queried the db for every dir record and mod time previously recorded. then it compared the mod times and existence- deleted all records on db that were no longer on disk , and then inserted all new dirs not db
second part. so.. now we have a list of all dirs that are out of sync by mod time. so i do this per directory:
get a listing of all regular files, and mod times from filesystem, and do the same for all records on db.
then compare them, delete db records not on disk any longer, and insert new records (ones that were not on db).
since this was a web thing, i actually do not update the db.. not exactly. when a user is in a directory, the app tests mod time on file for that dir, and compares to db mod time for that same dir. if it is diff, it runs the indexer and then you see the data. worked blazing fast for all kinds of junk.. md5sum too.. which can be time eating.
but.. had a lot of trouble with data corruption indeed..
I abandoned the whole thing. it's too unreliable to keep up to date. a LOT can go freaking wrong, the data can become really corrupt. the problem is .. the more you do to keep it current- the more taxing it is- the more processor power you are asking.. and it sort of beats the entire purpose!
i am really tempted to try it again, only using a db created with updatedb and locatedb stuff.. any ideas?
how do i import a file database (locatedb) created by updatedb into a mysql table? is this something i could do every hour for a repo of 30k files ?
what kind of db is one of these ?
Until I am able to consolidate filesystem to db with reliability - i have been actually using a mix of output from find, stat, and locate, to get my real time data for the user.
I should post my code somewhere (it really won't fit here), maybe we're *not* re-inventing the wheel trying to do this after all.
update: i wanted to mention also.. making sure the data was inserted for about 20k files took my script about 5 minutes. that's with no data in db- first run. . as for an update, the startup was slow, about 40 seconds. (it had to get ALL info for 6k dirs on disk, and 6k dirs on db.) - and then an average update took maybe on the whole, 1.5 minutes to 3 minutes. keep in mind, this is a repository with files that are being constantly changed by about 10 people.
In reply to Re: What kind of database should I use?
by leocharre
in thread What kind of database should I use?
by bart
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |