in reply to Efficient way to handle huge number of records?

This looks like a Bio Perl problem. I would look at those Perl modules.

11,000,000 records is pretty big! If you want to do a lot of searching, then an SQL DB is a good idea. I recommend SQlite although MySQL is fine albeit there are some admin hassles. SQLite has an advantage in that the DB is just a single file and all of this admin password and setting up accounts is not necessary. However, 11 million records isn't all that big as a hash table.

If you load this into a SQLite DB and then generate an index based upon NAME. A name based query will run fast.

What have you tried so far?

Either MySQL or SQLite can handle this number of records easily.

  • Comment on Re: Efficient way to handle huge number of records?

Replies are listed 'Best First'.
Re^2: Efficient way to handle huge number of records?
by cavac (Prior) on Dec 11, 2011 at 12:07 UTC

    Actually, i would be rather cautious when using MySQL. All the trade-offs and bugs depending on which data backend you use. These days, i really recommend switching to PostgreSQL.

    The license is more open (BSD instead of mixed GPL/Commercial), the speed is very much OK and the database is fully ACID compliant - which is getting rather important these days since sooner or later you'll probably run into the issue of porting your programs and modules to run on a multithreaded webserver (a.k.a "the cloud"). Making your stuff accessible through the Web seems to be the trend these days. Ask any manager what buzzwords he wants to use in his next speech to the shareholders and "cloud computing" and "HTML5" seem to top the list ;-)

    Ok, this was my personal point of view on this issue. Here is a (long) comparison about the two databases.

    Don't use '#ff0000':
    use Acme::AutoColor; my $redcolor = RED();
    All colors subject to change without notice.
      For single user, single process, I figure that SQLite has it. And I recommend it for those situations. It is completely Open Source.

      I think that your point about PostgreSQL is a good one! But I don't think the OP needs PostgreSQL or MySQL.

      From what has been described, the SELECT functions will be only on the NAME and that is easy and SQLite will do the job.

        But I don't think the OP needs PostgreSQL or MySQL.

        That's true of course. As long as the project stays small.

        The reason i usually try to "push" people into the direction of big scaleable databases is this: Many projects have a way of growing without you noticing at first.

        You start small with a one-off tool, reuse it sometimes later, maybe add a new feature or two. Then a coworker comes along, sees that usefull little tool and wants it too. Since (s)he needs another small feature or two, you add that as well. That even more usefull tool spreads to more coworkers, maybe even replacing that old standard tool you've used since the dark ages....

        A few years later that thing has grown into a big mess of code that works around the original design issue and you have to constantly battle against performance issues and race conditions, since the software was never meant to handle this. And you can't easely rewrite the software, because all the logic is in the old (undocumented) code...

        With one of the "big" databases and a clean, hierarchically data layout, you can do a lot of the basic data integrity checks (and business logic) in the database. So, writing additional, smaller tools (instead of adding to the big mess) is much easier, since the the basic system enforces data integrity, instead of having to code it up in each and every tool.

        Also, the big irons scale much better. Who knows, maybe the data layout stays the same, but the amount of data scales up by a factor of thousand by the end of the project? Or, just maybe, the size of each dataset stays the same but you get thousands of them, because someone realized that your tool does a nice job rerunning whateveritdoes on archived data sets or comparing changes over time?

        Ok, since i don't really know the original posters goals i can only speculate. But i've got a bloody nose more than one time assuming a project wouldn't grow or will never be business critical. So i'm very careful suggesting the small option when you could just as easy use one of the free "big irons" ;-)

        Don't use '#ff0000':
        use Acme::AutoColor; my $redcolor = RED();
        All colors subject to change without notice.
Re^2: Efficient way to handle huge number of records?
by Anonymous Monk on Dec 11, 2011 at 09:39 UTC
    Yes true it's bioinformatics, but I don't need any modules, I have my data in the text file...
    I wanted to start using Mysql with Perl but I can't seem to be able to install the Mysql module. I tried CPAN but the command:
    perl -MCPAN -e 'install Mysql'

    returns:

      OK. I recommend that you not mess with MySQL. There is more pain past just getting the DBI module. You will have to get a MySQL daemon started and there is admin hassle.

      I recommend that you start with SQLite. There is no daemon. There is no admin. It does 90% of what MySQL does for a single user and with a lot less hassle.

      I use the SQLMangager plug-in for firefox instead of the command line I/F which I have never bothered to install on my machine. You will need to install the DBD::SQlite module and that's about it.

      Try installing DBD::SQLite and see how it goes.

      Update:
      I do have a MySQL server running right now on my Windows machine.
      I know how to do it and that is why I am advising you not to do it!
      Too much hassle!