in reply to read a huge mysql Table into memory

10 pagehits/s and each containing about 20 queries

Either your schema is badly screwed, or one or more of those 20 queries are relatively complex. Perhaps full text searches on those large description fields you mentioned?

Assuming the latter, given 21 million records and the approximately 1K text (average) per record, then caching is unlikely to help unless your users are all searching for the same things. For unique or nearly unique queries, results caching is a waste of time and memory that can be better devoted to other things. You'd be better turning it off.

I have only 8G.... But after about 7M rows I started to run out of memory

DBI returns each record as an array, so you are building up an AoA. And an array of 7 million small arrays each containing 4 empty scalars will consume 3.5GB! That is, you will consume close to half of your memory doing this:

[0] Perl> $#a = 7e6;; [0] Perl> $a[ $_ ] = ['','','',''] for 0 .. 7e6 -1;; [0] Perl>

Seems hopeless, until you consider that it means that the actual data of those 7 million records is only consuming ~4.5 GB. Ie. About 700 bytes (average) per record. If instead of storing the records each in its own array, you stored them in a single string, then things look much more promising. This:

[0] Perl> $#a = 7e6;; [0] Perl> $data = 'X' x 700; $a[ $_ ] = $data for 0 .. 7e6 -1;;

Only requires around 5.2 GB rather than 8GB with the AoAs. Still seems a lost cause as this is still only 1/3rd of your data. But, what if you replaced the words with a packed 2-byte integer. That allows for a dictionary of up to 64k words--more than enough for most purposes; average English text uses less 10,000. Technical and/or high academic writing won't usually push that past 20k words.

The average English word length (including the trailing space) is 6 (ASCII) characters. So by replacing the words with 2-byte integers will reduce the data space requirement (excluding the Perl structural overhead) by ~2/3rds. So 7e6 * 700 / 3 = 1.5GB. Multiply that by 3 again to go from 7e6 to 21e6 records and you're back to 5.2 GB for the entire dataset. Well within your limits and with enough headroom for some fairly substantial future growth. And searching the packed integer strings is faster than searching the text it replaces--it's 1/3rd the length; both keywords and text--though you do have to remember to discard any hits found on odd byte boundaries.

Re-constituting the text is a very fast process:

print join ' ', @dict[ unpack 'V*', $records[ $hit ] ];

And you could go further. Anything from 1/4 to 1/3rd of English text is made up from the 10 or so most common words (stop words). And of the remaining words, about another third is made up of repetitions. If you eliminated stop words and duplicates, you could probably halve your data again. Further reducing memory and speeding up your searches. The downside is that once you've located the records of interest, you'd have to query the full text back from the DB. Though, this can often be mitigated by presenting the matching records in their stopped and de-duped form and only querying the DB for the full records individually as the user selects them.

If your table is relatively static, then you can build your packed record structure as an off line process and save it to disk in Storable format for faster loading and only re-build it once a day or once per week as a part of your maintenance restart cycle.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."