Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
>NAME1 SEQUENCE1 >NAME2 SEQUENCE2 >NAME3 SEQUENCE3
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Efficient way to handle huge number of records?
by BrowserUk (Patriarch) on Dec 11, 2011 at 10:05 UTC | |
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by flexvault (Monsignor) on Dec 11, 2011 at 11:36 UTC | |
Just to comment on the 32/64 bit (re. Linux), a 32 bit Linux system can support 64GB of memory, but each process is limited to 4GB ( usually 3.7GB ). So each Perl application would have approximately 3.7 GB to handle script and data. In the windows world, you are absolutely correct. Thank you "Well done is better than well said." - Benjamin Franklin | [reply] |
by BrowserUk (Patriarch) on Dec 11, 2011 at 12:11 UTC | |
Hm. You're stretching several boundaries beyond their limits there:
Win32 can also (and I believe was first) to do Page Address Extension (PAE). It can also extend the default 2GB user space to 3GB per process. But just like linux, these limits are extended through a series of cludges that have drawbacks as well as benefits. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by Marshall (Canon) on Dec 11, 2011 at 12:30 UTC | |
by BrowserUk (Patriarch) on Dec 11, 2011 at 13:46 UTC | |
| |
by flexvault (Monsignor) on Dec 11, 2011 at 18:14 UTC | |
by BrowserUk (Patriarch) on Dec 11, 2011 at 18:22 UTC | |
| |
by erix (Prior) on Dec 11, 2011 at 14:11 UTC | |
Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble. I would be quite interested to see SQLite do this. (may even try it myself...) In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use. I'd love to hear that this has improved - SQLite is nice, when it works. Does anyone have recent datapoints? (As far as I am concerned, Mysql and BerkeleyDB, as oracle products, are not an serious option anymore (I am convinced Oracle will make things worse for non-paying users all the time), but I am interested to know how their performance (or Oracle's itself for that matter) compare to PostgreSQL) | [reply] |
by baxy77bax (Deacon) on Dec 11, 2011 at 17:01 UTC | |
So if you need a db engine that is fast and reliable and can deal with lots of data you will want SQLite. Now as far as the initial question goes, you can do something similar to what MySQL does. Yu could split the file into chunks and index chunks by the line numbers so that you know in which line does the header of you sequence appear. Once you did that you need to hash only those indexes. This will reduce the search the number of times prop. to the number of fragments you have after chomping your initial | [reply] |
by BrowserUk (Patriarch) on Dec 11, 2011 at 16:37 UTC | |
In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow I said "handle" not "handle well" :) That said, I had SQLite on my old machine and found that .import file table via the sqlite3.exe was substantially faster than doing inserts via SQL. Whether from the command line utility or via Perl & DBI. I wish I could get a 64-bit build for my system. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by flexvault (Monsignor) on Dec 16, 2011 at 14:25 UTC | |
erix, Sorry to take so long, real work got in the way. Hears the updated info, I used BrowserUk's sub to generate the data. On the key part is was 40bytes long, but on the data part it was 320bytes, so I 'substr' it to 80 (If 320 is correct, I can run the tests again). I ran the test for 1_000_000, since for your purposes it didn't matter. Also I multipled the times by 1000 to get the results in milliseconds. I generated the random keys at the beginning, so that they wouldn't be in cache. Here it is:
Running it and Output:
Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Dec 16, 2011 at 14:36 UTC | |
by flexvault (Monsignor) on Dec 16, 2011 at 14:46 UTC | |
by Marshall (Canon) on Dec 19, 2011 at 11:15 UTC | |
I showed the framework of some code at Re^3: Efficient way to handle huge number of records? which is an abbreviated version of some code that I'm currently working on. One table has a million records of 50 fields. So I ran a couple of tests. First test was with all the speed-up stuff turned off: Now I figure that qualifies as "ridiculously slow!". I actually had to run it twice because I got one of those "Windows Automatic Reboot was required" things! Bummer when that happens after one day of processing! Using the optimizations and by FAR and away the biggest effect is to do a single transaction! Results in: Or about 3 minutes instead of 2 days! A lot better! This is fast enough for my needs. Using the bulk import utility would probably be faster, but I haven't tested that because 3 minutes doesn't bother my application. I have another app that builds a 500K record table and it builds it from 1,000 input files. Takes about 1/2 the time or about 90 seconds. Its not worth my programming effort to emit an intermediate file in the whatever the bulk import utility needs - I just put the data into the DB right away. A reasonable programming tradeoff. Mileage varies. It should be noted that my machine is an older one, a hyper threaded one (before the multi-core days), the Prescott stepping - last one with PGA (pin grid array) and my disks are only 5K rpm (not 7K+). A more modern machine can run a single process at least 4x this fast or about 45 seconds instead of 3 minutes (I've bench marked my machine vs a friend's on similar tasks before). The time scales linearly, so 11M records would take 10x as long. Is that "ridiculously slow?", I don't know. I guess that depends upon the application. I do have a MySQL server running on my machine and in the past I've done some benchmarking vs SQLite. For a complicated query, MySQL is faster, but for my current projects, SQLite "wins" due to admin simplicity (none required!). | [reply] [d/l] [select] |
Re: Efficient way to handle huge number of records?
by Marshall (Canon) on Dec 11, 2011 at 09:02 UTC | |
11,000,000 records is pretty big! If you want to do a lot of searching, then an SQL DB is a good idea. I recommend SQlite although MySQL is fine albeit there are some admin hassles. SQLite has an advantage in that the DB is just a single file and all of this admin password and setting up accounts is not necessary. However, 11 million records isn't all that big as a hash table. If you load this into a SQLite DB and then generate an index based upon NAME. A name based query will run fast. What have you tried so far? Either MySQL or SQLite can handle this number of records easily.
| [reply] |
by cavac (Prior) on Dec 11, 2011 at 12:07 UTC | |
Actually, i would be rather cautious when using MySQL. All the trade-offs and bugs depending on which data backend you use. These days, i really recommend switching to PostgreSQL. The license is more open (BSD instead of mixed GPL/Commercial), the speed is very much OK and the database is fully ACID compliant - which is getting rather important these days since sooner or later you'll probably run into the issue of porting your programs and modules to run on a multithreaded webserver (a.k.a "the cloud"). Making your stuff accessible through the Web seems to be the trend these days. Ask any manager what buzzwords he wants to use in his next speech to the shareholders and "cloud computing" and "HTML5" seem to top the list ;-) Ok, this was my personal point of view on this issue. Here is a (long) comparison about the two databases.
Don't use '#ff0000': use Acme::AutoColor; my $redcolor = RED(); All colors subject to change without notice. | [reply] |
by Marshall (Canon) on Dec 11, 2011 at 12:50 UTC | |
I think that your point about PostgreSQL is a good one! But I don't think the OP needs PostgreSQL or MySQL. From what has been described, the SELECT functions will be only on the NAME and that is easy and SQLite will do the job. | [reply] |
by cavac (Prior) on Dec 11, 2011 at 13:54 UTC | |
by Anonymous Monk on Dec 11, 2011 at 09:39 UTC | |
I wanted to start using Mysql with Perl but I can't seem to be able to install the Mysql module. I tried CPAN but the command:
returns:
Read more... (4 kB)
| [reply] [d/l] [select] |
by Marshall (Canon) on Dec 11, 2011 at 10:43 UTC | |
I recommend that you start with SQLite. There is no daemon. There is no admin. It does 90% of what MySQL does for a single user and with a lot less hassle. I use the SQLMangager plug-in for firefox instead of the command line I/F which I have never bothered to install on my machine. You will need to install the DBD::SQlite module and that's about it. Try installing DBD::SQLite and see how it goes.
Update: | [reply] |
Re: Efficient way to handle huge number of records?
by leuchuk (Novice) on Dec 11, 2011 at 10:10 UTC | |
Answer to question (I): I don't think it's a problem for Perl to handle this size. The problem could be your computer (and may be your patience). A hash needs as far as I remember a minimum of 55 bytes per hash key + the content. 11 million times 55 = 605 million bytes = 576 MByte for the overhead... Do you have enough memory for the whole file + overhead + OS + ...? So when your hardware doesn't have enough memory it's swapping and that can make your solution unacceptable slow. Question (II): I assume that you are no database expert, are you? 11 million records is no problem at all (depending on your table definition), but your data may be. For example if your sequence just fits in a certain context. So that NAME2 is to be interpreted different depending on the record NAME1 before and/or the record NAME3 after. But if this would be the case an hash isn't the right solution, too. Just bringing these data in your database may be just half of the solution. The full answer depends on your way to find these 100, 10 or 45 records. Your reading algorithm should be good enough to read the database in a rational way meaning the Perl script doesn't read the records a zillion times... A database is no solution, too, if you just read the file sequentially (a "SELECT * from table;") compute your results then somehow and pick your 100, 10 or 45 records. The overhead for storing your data in the database and retrieving it could be killing all wins by using a database. Marshall proposed SQLite. This is really a good solution, especially if you work with a limited number of processes accessing the data base at the same time. Personally I wouldn't take MySQL, I'd prefer PostgreSQL, but this is just my opinion. | [reply] |
by Anonymous Monk on Dec 11, 2011 at 10:22 UTC | |
I have a large database of records as I wrote before, in the form of: etc. The NAME is usually 50-60 chars (small) but the SEQUENCE can be from 100 chars to maybe 2-3000 chars as you also pointed out. What my script does (or I want it to do) is: The user supplies an unknown query input, then I run a program called BLAST which tells me that this unknown input had 1,2,3....50...60...200 hits in my database. So, I know the NAME of each hit. My problem is that each time I need to look up the database file and retrieve the respective entries and then create a new file (let's say output file) which will have all NAME and SEQUENCE of the hits to process it further. Plus my ultimate purpose it to create a webserver, so therefore this database file will be accessed quite frequently, that's why I am asking whether a database or a simple local search through a hash (at least that's what I can think about) is more recommended. | [reply] [d/l] |
by wfsp (Abbot) on Dec 11, 2011 at 13:34 UTC | |
However, it seems from your description that the “unknown query” would parse/search on the value of each key (“search through a hash”) i.e. look at each value in turn for each query. I always think of a hash as a “lookup table” and it doesn't look like this is what you'll be doing. It maybe that an array would be more suitable, less overhead at least. Something like NAME1|SEQUENCE1. According to the docs D::D can build disc based arrays too although I have never tried. How unknown are your unknowns? Could there be general subsets/groups that could be preprocessed; you say there is additional processing to be done on the results – could that be done upfront? A series of additional lookups that could identify a smaller group of records to search – divide and conquer? If you did choose the RDBMS route are there indices that could be built that would help find what you're after. If there aren't it would, imo, likely negate any advantage. If you have to process each record for each search then whatever method you adopt is going to be slow and possibly not suitable for a website. If you could up come with an outline of any known unknowns there is a good chance the monks could give you some pointers on how to avoid what seems like a brute force approach. It would be a challenge many of them would be unable to resist. :-) | [reply] [d/l] |
by jethro (Monsignor) on Dec 11, 2011 at 14:45 UTC | |
by jethro (Monsignor) on Dec 11, 2011 at 14:21 UTC | |
Wait, is the program called BLAST something you already have or is that the script you want to create? I.e. why would you use BLAST to search and then use a different script for retrieval? Obviously when the retrieval script creates that output file the number of hits is available with zero effort If you want to use your script with a webserver you probably will observe the following with the different solutions: A) perl CGI script, loading hash in memory: Really bad, every search has to load the database into memory. Even if that takes only 4 seconds it is mayor disk wear and slow. Same with Storable B) Using mod_perl, loading hash into memory: With a trick (using a BEGIN block to load the hash) you could read the data only once into memory. Big improvement on execution speed, but memory could be a problem as apache usually creates multiple instances of the web server each running the script and data. Depends on the server engine you use, a threaded engine probably could share the data but not a forking engine (Me no expert here, someone hopefully will correct me if I'm wrong). Same with Storable C) Using SQLite, DBM::Deep, ...: Disk based methods. Data is not loaded into memory, an index keeps retrieval reasonably fast. Obviously somewhat slower than a simple hash in memory, but still simple to program. No problem if your database grows in size (within limits), no noticable startup time. If the database does mostly reads and seldom writes arguably the best solution in your case IMHO D) Using mysql or postgresql: Apart from the web server you have to install and manage another server (easy on linux (easy for me, not at all easy for my mother ;-) ), not so easy on windows(?)). Reading and changing data in the database never is a performance problem irrespective of size. Complex databases possible (which you don't seem to need), the database server will get bored to death in your case ;-). Probably a notch slower than solution C, but the only solution if you want 100 simultaneous apache threads reading and writing to that database. | [reply] |
Re: Efficient way to handle huge number of records?
by educated_foo (Vicar) on Dec 11, 2011 at 16:00 UTC | |
Then later (many times) using this index to retrieve the sequence data (also untested): Sorting your data won't really work so well since, with variable-length records, you have to do a linear scan of the file to find the linebreaks. | [reply] [d/l] [select] |
by Anonymous Monk on Jul 11, 2013 at 22:28 UTC | |
| [reply] |
Re: Efficient way to handle huge number of records?
by erix (Prior) on Dec 11, 2011 at 16:39 UTC | |
On a slow desktop (which is also busy doing other database stuff), I get the below performance with PostgreSQL (9.2devel git/HEAD). With 10,000,000-sequence fasta was made (with BrowserUK's script, adapted - thanks). This yielded a 1.2G junk.dat file:
Now the interesting part: retrieval (I don't know why others are not posting retrieval results?). I retrieved 10 random rows, and did that 4 times in succession, to show caching effect.
That's not so bad, is it? (And remember this desktop (2GB, AMD Athlon(tm) II 160u, 800Hz -- yes it's slow) is in the background doing searches (indexed, but also full table scans) of uniprot, which, as you perhaps know, has more than 18M records. (If you cannot get mysql installed all the above may also be a bit over your head; nevertheless you should consider to learn database work -- it will often come in handy in bioinformatics. And since I've now done the work, I'll post it, for what it's worth ;) ) (I may update the timings when the machine becomes otherwise idle) | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Dec 11, 2011 at 22:26 UTC | |
Only 5 million records because I have an image analysis that has been running for 2 days now using over half my memory and I don't want to interrupt it. I retrieved the first 1000 records from the hash -- as near random as makes no difference -- because 10 wasn't representative. Retrieval time is 0.4 millisecs per. The code:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by flexvault (Monsignor) on Dec 11, 2011 at 21:40 UTC | |
If you notice in the code, I commented out 2 lines and changed the 3rd to "A-$cnt", since the poster didn't care about random. I do, since that affects some DBs. If you want, I can run it so that after 30 seconds, I select a record, and then after completion, I'll read it back 1 time, and then 10 times with timing. Is that what you want? The system is 2-core 2GB AMD, 2.5 or 2.8 MHz. When I get you the times, I'll find out exactly what it is, but to compare I only run on 1-core. Thank you "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] |
Re: Efficient way to handle huge number of records?
by mrguy123 (Hermit) on Dec 11, 2011 at 11:41 UTC | |
If the text file is the same each time, a good option you could use that is simpler than using SQL is using storables. This way, you can load the data into a hash once and then store it as a serialized file which can be retrieved fairly quickly. I also had a similar bioinformatics problem which was solved this wayGood Luck! MrGuy | [reply] |
by Anonymous Monk on Dec 11, 2011 at 12:58 UTC | |
I have always the same kind of database file, which is:
Apparently I'll create a "big" hash with protein1, protein2, protein3 etc as keys and ASKRKGRKGTTRHTRHTRHTRHTHRHTR, REHGTRYUYTJUYJYTJUJUJUJYJUYJUY, EWFDEWFEFREFRE etc as values. I am a bit confused as about the usage of this module though I browsed a bit in the respective manual page... Once you create the hash, how do you call it? Is it retrieved each time the script is called? If you can please supply a brief example of usage! Thanks a bunch! | [reply] [d/l] [select] |
by Marshall (Canon) on Dec 11, 2011 at 13:47 UTC | |
Update: Basically when creating the DB, you want to turn all the ACID stuff off. That means: don't wait for writes to complete, don't do complete transactions for each "add", run the cache size up from default of 20MB to at least 200 MB for more efficient index creation. Run as "lean and mean" as you can. | [reply] [d/l] |
by leuchuk (Novice) on Dec 11, 2011 at 14:29 UTC | |
by Marshall (Canon) on Dec 11, 2011 at 14:51 UTC | |
by mrguy123 (Hermit) on Dec 11, 2011 at 14:37 UTC | |
Good luck MrGuy | [reply] [d/l] |
Re: Efficient way to handle huge number of records?
by flexvault (Monsignor) on Dec 11, 2011 at 14:10 UTC | |
The following code will generate the database using BerkeleyDB. I generated the keys, you would read them in from you file.
And the output would be.
And the directory where the database resides would contain:
And to test that all the data was loaded:
'diff' runs out of memory with files this large. But 'md5sum' has no problem. The extra step to verify the database is loaded is optional, but it is nice to know you have all of the data. I test different databases all the time, so there is a lot of extra code that you can delete. I did more deleting then adding to get what you needed. Good Luck "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] [select] |
Re: Efficient way to handle huge number of records?
by Sewi (Friar) on Dec 11, 2011 at 14:15 UTC | |
It might be faster or slower than a SQL-based database depending on what you're doing... | [reply] |
Re: Efficient way to handle huge number of records?
by TJPride (Pilgrim) on Dec 11, 2011 at 13:33 UTC | |
| [reply] |
Re: Efficient way to handle huge number of records?
by locked_user sundialsvc4 (Abbot) on Dec 12, 2011 at 13:26 UTC | |
If you’ve got the memory, and it is uncontested memory (i.e. swapping will not occur), then your problem is indeed an easy one: “throw silicon at it,” and be done. Nothing except the CPU itself is faster than silicon. If you don’t have enough memory (but, you do have a 64-bit machine), well, how much do those chips cost? Obviously, there is time spent loading all that data into memory, which may or may not be desirable. (It really depends on how much of the data you expect to process at any one time. It would be a one-time cost per run.) Another possibility is to use (say...) an SQLite database file purely as an index to the file, storing the starting-position of the interesting bits as integer offsets from the beginning of the file. Use the index to find what you’re looking for. Also consider sorting this list of offsets (in memory, of course) into ascending order so that the hardware can zip right through the file sequentially from front to back, seeking to always-forward positions as needed. May not make a difference, but it might. It is easy to get too-clever and to spend a lot of time and effort implementing things that really do not matter in the end; that sound cool in theory but that really do not speed things up vis à vis the time spent writing and debugging it. A simple stopwatch comes in handy. It may well be that you keep a hash of what you’re looking for and the program reads the entire file each time (or until it has found everything), and, even though the run time is “larger than it might be,” it is consistent. | |
by Anonymous Monk on Dec 13, 2011 at 03:56 UTC | |
| [reply] |