Echoing (and upvoting) Roboticus' suggestions, I have found that there can simply be tremendous improvements to be had when the TCP/IP communication link (no matter how fast it is) is eliminated from the processing picture. If you’ve got a lot of data to be processed, first get it in table-form (or if necessary, flat-file form) to that server or to another server that might be connected to it, say, by an optical fiber link, e.g. a SAN or what-have-you. Then, do the processing as “locally” as you possibly can, provided of course that you do not overload a server that has been dedicated to and configured for a particular task, with a task for which it is not intended.
Then, construct your processing in a way that features these considerations, for instance:
Note that in the following as originally writ, I assumed the update file was 30GB. As far as database-tables go, 30GB is merely “moderate size.” The basic principles pontificated :-) here, still hold.
-
It is done mostly with SQL, but it is not an arduous task. Use EXPLAIN on the queries that you are contemplating, to see exactly how the system will do it. Explore the process you have in mind; map it out on paper before you build it. Avoid “build it just to see if it works,” because then you’re always scrambling, and losing credibility quickly. The best way to stay on top of the ball, is to studiously avoid falling off.
-
Bulk-operations are done as much as possible with tools and/or database-facilities that are designed by the vendor for the purpose.
-
Pay attention to transactions: use ’em. Transactions are a clear indication to the server as to how much data it can keep “in memory” for a few milliseconds, and how much must be immediately committed to disk. Disk-I/O delays add up very quickly. But, don’t let your transactions become too big. COMMIT regularly throughout, and of course, once more at the end. If an operation fails, ROLLBACK. It is desirable to use “highly exclusive of others” transaction isolation levels, but this might need to be coordinated with other simultaneous activities. (You also must consider whether other users/applications should be permitted to see incompletely-loaded or incompletely-modified data.)
-
What IBM used to call “checkpoint/restart.” If the process of churning through 30GB of data fails at any point, it should be possible to fairly-automatically resume. Likewise, if it is discovered that the data which was applied was incorrect, a strategy must exist for getting the database back to a previous known-good state. Also consider how you intend to identify, catalog, keep, and then eventually discard those 30GB data files.