hiX0r has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge (21M records ) table and got sphinx search to get me the id of the searched match for each record. That works excellent and really fast, just nicely :)

Now to be fast in building the web-pages and to be able to answer around 10 pagehits/s and each containing about 20 queries to mysql I wanted to read the whole table into a simple array ( array index being the record id). There is no write process, just select statements. Or the fetching from the array in its place as I thought would be the fasted way to get it done. But Perl seems to take a lot more ram than the data on disk and I have only 8G.

Any suggestions or hints on how to get the data into memory? Or is that the wrong approach?
Here is what I tried:
my $sql="SELECT * FROM product"; my $sth = $dbh->prepare( $sql ) or die "Couldn't prepare statement: " +. $dbh->errstr; $sth->execute() or die "Could not execute SQL statement:$sql \n" . $st +h->errstr; my $rv = $sth->rows; my $count = 0; my $rowcache = []; # cache for batches of rows # while ( my $row = shift(@$rowcache) || shift(@{$rowcache=$sth->fetch +all_arrayref(undef, 10000) || []}) ){ while ( my @row = $sth->fetchrow_array() ){ # $content[$count] = \@row; # $count++; $count = push(@content, \@row); print "count = $count\n"; }
But after about 7M rows I started to run out of memory.... The I tried processing small junks (1000 rows at a time) but the same problem.

I am fairly now to Perl but don't mind doing the reading & work so just some hints would much appreciated !

Thx in advance for any help!

Replies are listed 'Best First'.
Re: read a huge mysql Table into memory
by CountZero (Bishop) on May 30, 2009 at 17:48 UTC
    There is indeed some overhead in Perl-arrays, so it is correct that the array will take more space in memory than in the MySQL-table.

    But 10 queries per second is not much and MySQL should be able to keep up with that number of requests, assuming that the record ID you search on is indexed.

    Another thing to consider: as you are speaking of web-pages, I assume your script will run on the web-server. Is it a CGI-script or does it run under mod_perl? If it is a simple CGI-script, you will have to build your array for every request and that will take ages. Under mod_perl you will only pay that start-up cost once, but I still think a SELECT into your data-table will be the best solution.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      sorry was not precise: 10/s pagehits each containing several (different pages with different tasks but a good average would be around 15 ) mysql tasks so we talk more in the range of 150 to 200... and then see my reply above, speed is what counts :)
        That should still be within the possibilities of MySQL.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: read a huge mysql Table into memory
by zwon (Abbot) on May 30, 2009 at 17:39 UTC

    Just allocate more memory for MySQL and don't complicate things. Are you really experience problems with database speed? Then maybe you should review structure of database. If your provide more details about your data and database there are chances you'll get a better answer.

      thx for the fast answer! the reason I did not post "source data structure" is that I wanted to see if Perl has solutions...

      The data field I am concerned about is a mysql "text" field called "description" and may contain long text. The other is just the according id. The rest is index and fetched via sphinx.

      And regarding speed: never fast enough :) as "first come first serve" is the fixed sorting order and we have to be faster than the others :)

        Well, that's pretty simple table and I'm sure you have index on id column, so 200 queries per second shouldn't be a problem for mysql if you just selecting records by id. Anyway as you already saw perl requires too much memory to store your data, so it seems you're tied with database solution.

Re: read a huge mysql Table into memory
by perrin (Chancellor) on May 30, 2009 at 22:56 UTC
    MySQL already has a cache. If your queries aren't fast enough, you should figure out why in the database. If they are impossible to cache in the database for some reason, use CHI to cache the data that you actually fetch, not the entire data set.
Re: read a huge mysql Table into memory
by BrowserUk (Patriarch) on May 31, 2009 at 03:40 UTC
    10 pagehits/s and each containing about 20 queries

    Either your schema is badly screwed, or one or more of those 20 queries are relatively complex. Perhaps full text searches on those large description fields you mentioned?

    Assuming the latter, given 21 million records and the approximately 1K text (average) per record, then caching is unlikely to help unless your users are all searching for the same things. For unique or nearly unique queries, results caching is a waste of time and memory that can be better devoted to other things. You'd be better turning it off.

    I have only 8G.... But after about 7M rows I started to run out of memory

    DBI returns each record as an array, so you are building up an AoA. And an array of 7 million small arrays each containing 4 empty scalars will consume 3.5GB! That is, you will consume close to half of your memory doing this:

    [0] Perl> $#a = 7e6;; [0] Perl> $a[ $_ ] = ['','','',''] for 0 .. 7e6 -1;; [0] Perl>

    Seems hopeless, until you consider that it means that the actual data of those 7 million records is only consuming ~4.5 GB. Ie. About 700 bytes (average) per record. If instead of storing the records each in its own array, you stored them in a single string, then things look much more promising. This:

    [0] Perl> $#a = 7e6;; [0] Perl> $data = 'X' x 700; $a[ $_ ] = $data for 0 .. 7e6 -1;;

    Only requires around 5.2 GB rather than 8GB with the AoAs. Still seems a lost cause as this is still only 1/3rd of your data. But, what if you replaced the words with a packed 2-byte integer. That allows for a dictionary of up to 64k words--more than enough for most purposes; average English text uses less 10,000. Technical and/or high academic writing won't usually push that past 20k words.

    The average English word length (including the trailing space) is 6 (ASCII) characters. So by replacing the words with 2-byte integers will reduce the data space requirement (excluding the Perl structural overhead) by ~2/3rds. So 7e6 * 700 / 3 = 1.5GB. Multiply that by 3 again to go from 7e6 to 21e6 records and you're back to 5.2 GB for the entire dataset. Well within your limits and with enough headroom for some fairly substantial future growth. And searching the packed integer strings is faster than searching the text it replaces--it's 1/3rd the length; both keywords and text--though you do have to remember to discard any hits found on odd byte boundaries.

    Re-constituting the text is a very fast process:

    print join ' ', @dict[ unpack 'V*', $records[ $hit ] ];

    And you could go further. Anything from 1/4 to 1/3rd of English text is made up from the 10 or so most common words (stop words). And of the remaining words, about another third is made up of repetitions. If you eliminated stop words and duplicates, you could probably halve your data again. Further reducing memory and speeding up your searches. The downside is that once you've located the records of interest, you'd have to query the full text back from the DB. Though, this can often be mitigated by presenting the matching records in their stopped and de-duped form and only querying the DB for the full records individually as the user selects them.

    If your table is relatively static, then you can build your packed record structure as an off line process and save it to disk in Storable format for faster loading and only re-build it once a day or once per week as a part of your maintenance restart cycle.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.