Re: Efficient way to handle huge number of records?

Answer to question (I):

I don't think it's a problem for Perl to handle this size. The problem could be your computer (and may be your patience).

A hash needs as far as I remember a minimum of 55 bytes per hash key + the content. 11 million times 55 = 605 million bytes = 576 MByte for the overhead... Do you have enough memory for the whole file + overhead + OS + ...? So when your hardware doesn't have enough memory it's swapping and that can make your solution unacceptable slow.

Question (II): I assume that you are no database expert, are you? 11 million records is no problem at all (depending on your table definition), but your data may be. For example if your sequence just fits in a certain context. So that NAME2 is to be interpreted different depending on the record NAME1 before and/or the record NAME3 after. But if this would be the case an hash isn't the right solution, too.

Just bringing these data in your database may be just half of the solution. The full answer depends on your way to find these 100, 10 or 45 records. Your reading algorithm should be good enough to read the database in a rational way meaning the Perl script doesn't read the records a zillion times... A database is no solution, too, if you just read the file sequentially (a "SELECT * from table;") compute your results then somehow and pick your 100, 10 or 45 records. The overhead for storing your data in the database and retrieving it could be killing all wins by using a database.

Marshall proposed SQLite. This is really a good solution, especially if you work with a limited number of processes accessing the data base at the same time. Personally I wouldn't take MySQL, I'd prefer PostgreSQL, but this is just my opinion.

Comment on Re: Efficient way to handle huge number of records?

Replies are listed 'Best First'.
Re^2: Efficient way to handle huge number of records? by Anonymous Monk on Dec 11, 2011 at 10:22 UTC
Ok, first of all thanks to you all for answering to me. Let me explain thoroughly my task: I have a large database of records as I wrote before, in the form of: `NAME1 SEQUENCE1 NAME2 SEQUENCE2` [download] etc. The NAME is usually 50-60 chars (small) but the SEQUENCE can be from 100 chars to maybe 2-3000 chars as you also pointed out. What my script does (or I want it to do) is: The user supplies an unknown query input, then I run a program called BLAST which tells me that this unknown input had 1,2,3....50...60...200 hits in my database. So, I know the NAME of each hit. My problem is that each time I need to look up the database file and retrieve the respective entries and then create a new file (let's say output file) which will have all NAME and SEQUENCE of the hits to process it further. Plus my ultimate purpose it to create a webserver, so therefore this database file will be accessed quite frequently, that's why I am asking whether a database or a simple local search through a hash (at least that's what I can think about) is more recommended.	[reply] [d/l]
Re^3: Efficient way to handle huge number of records? by wfsp (Abbot) on Dec 11, 2011 at 13:34 UTC
An alternative to building a hash in memory would be to have a disc based hash like DBM::Deep. Size is limited by the size of your disc, it is fast and it is particularly suited to “write seldom, read many” which appears to be your case. However, it seems from your description that the “unknown query” would parse/search on the value of each key (“search through a hash”) i.e. look at each value in turn for each query. I always think of a hash as a “lookup table” and it doesn't look like this is what you'll be doing. It maybe that an array would be more suitable, less overhead at least. Something like `NAME1\|SEQUENCE1`. According to the docs D::D can build disc based arrays too although I have never tried. How unknown are your unknowns? Could there be general subsets/groups that could be preprocessed; you say there is additional processing to be done on the results – could that be done upfront? A series of additional lookups that could identify a smaller group of records to search – divide and conquer? If you did choose the RDBMS route are there indices that could be built that would help find what you're after. If there aren't it would, imo, likely negate any advantage. If you have to process each record for each search then whatever method you adopt is going to be slow and possibly not suitable for a website. If you could up come with an outline of any known unknowns there is a good chance the monks could give you some pointers on how to avoid what seems like a brute force approach. It would be a challenge many of them would be unable to resist. :-)	[reply] [d/l]
Re^4: Efficient way to handle huge number of records? by jethro (Monsignor) on Dec 11, 2011 at 14:45 UTC
I think you are reading too much into his use of the word "search", but you have a point there. If he ever wants to do lots of searches for parts of a key (i.e. "contains" instead of "equals"), a hash will be as bad as a simple array.	[reply]
Re^3: Efficient way to handle huge number of records? by jethro (Monsignor) on Dec 11, 2011 at 14:21 UTC
Wait, is the program called BLAST something you already have or is that the script you want to create? I.e. why would you use BLAST to search and then use a different script for retrieval? Obviously when the retrieval script creates that output file the number of hits is available with zero effort If you want to use your script with a webserver you probably will observe the following with the different solutions: A) perl CGI script, loading hash in memory: Really bad, every search has to load the database into memory. Even if that takes only 4 seconds it is mayor disk wear and slow. Same with Storable B) Using mod_perl, loading hash into memory: With a trick (using a BEGIN block to load the hash) you could read the data only once into memory. Big improvement on execution speed, but memory could be a problem as apache usually creates multiple instances of the web server each running the script and data. Depends on the server engine you use, a threaded engine probably could share the data but not a forking engine (Me no expert here, someone hopefully will correct me if I'm wrong). Same with Storable C) Using SQLite, DBM::Deep, ...: Disk based methods. Data is not loaded into memory, an index keeps retrieval reasonably fast. Obviously somewhat slower than a simple hash in memory, but still simple to program. No problem if your database grows in size (within limits), no noticable startup time. If the database does mostly reads and seldom writes arguably the best solution in your case IMHO D) Using mysql or postgresql: Apart from the web server you have to install and manage another server (easy on linux (easy for me, not at all easy for my mother ;-) ), not so easy on windows(?)). Reading and changing data in the database never is a performance problem irrespective of size. Complex databases possible (which you don't seem to need), the database server will get bored to death in your case ;-). Probably a notch slower than solution C, but the only solution if you want 100 simultaneous apache threads reading and writing to that database.	[reply]