Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Big database queries

by ezekiel (Pilgrim)
on Aug 30, 2002 at 00:14 UTC ( [id://193952]=perlquestion: print w/replies, xml ) Need Help??

ezekiel has asked for the wisdom of the Perl Monks concerning the following question:

I have a really simple script that queries a database then loops over each row like this:

$sth = $dbh->prepare("select COLUMN_1 from table where COLUMN_2 < 2500 +"); $sth->execute(); while (my ($column_1_value) = $sth->fetchrow_array()) { # do some processing on the column 1 value here }

There is nothing fancy here except the query brings back 60 million rows! As a result, when I try to run the script it brings the database and machine running the database to its knees during the execution of the SQL statement.

Short of database level things or memory upgrades (that are largely outside my control) does anyone have any suggestions for handling these big queries in a more effective manner?

One thought I had was to break the query up into smaller chunks as such:

$sth = $dbh->prepare("select COLUMN_1 from table where COLUMN_2 betwee +n ? and ?"); for ($i = 0; $i < 2500; $i += 100) { $sth->execute($i, $i + 100); while (my ($column_1_value) = $sth->fetchrow_array()) { # do some processing on the column 1 value here } }
but that seems ugly to me. Surely there must be a more elegant solution?

Thanks.

Replies are listed 'Best First'.
Re: Big database queries
by dreadpiratepeter (Priest) on Aug 30, 2002 at 00:49 UTC
    A couple of thoughts:

    1. have you profiled the code? Is the time in the execute or the iteration?
    2. If it is in the execute, are the tables properly indexed?
    3. You don't say which DB you are using, some have statustics gathering agents that optimize querys. For example, Informix allows you to update statistics on tables and columns. I've have seen this process reduce query time from heat death of the universe down to 3-5 seconds for a complex join or view.
    4. Most commercial DBs have an explain plan feature, which tells you how the query optimizer will try to quickly return data, and can be tuned.

    5. If it is the iteration that is taking the time, I would suggest using the perl profiler to optimize the code in the loop.

    Hope some of this helps,

    -pete
    "Pain heals. Chicks dig scars. Glory lasts forever."

      Thanks for your response. In answer to your questions:

      1. Yes. The time is in the execution, not the iteration.
      2. Yes.
      3, 4. I am using DB2.

      It seem I will have to spend more time looking at things in the database. This is unfortunate. I like to stay on the Perl side of DBI :-)

        Staying as much in perl-space as possible is a good idea. One of the in-house apps here had all or most of the business logic implemented in stored procedures. Then, the volume increased so much that they needed a better database server. They are still coding away, after a long time. Several teams from the DB vendor have tried, but not suceeded.

        Anyhow,

        UPDATE table_name SET column1=column1 + 3 WHERE column2 > 2500
        might do the trick. It will still be gruelling, though. 60 million transactions is a lot, either way we look at it.

        I've also had great successes with introducing new indexes. It's incredible what a difference a well-placed index can do when you are dealing with large data sets. Have you tried to your queries through explain?

Re: Big database queries
by gmax (Abbot) on Aug 30, 2002 at 06:35 UTC
    The way I see it, there might be a design problem behind your question. I wonder what are you going to do with 60 million records on the client side ...
    I don't think you are going to show the records to a user, because at a rate of 50 records per page, it would take 1,200,000 pages, and no user is willing to go through that, also because, reading one page per second, the task would last more than 13 days. :)
    So I could see only two reasons for this behavior:
    • You need to either print a report or store the results somewhere ;
    • You need to do some calculation involving the value.
    In the first case, to improve efficiency, you have no choice but to break your query into several chunks, and fetch the records it to your client. Selecting 6,000 or even 60,000 records each time is going to be a lot faster than waiting for your DBMS to allocate space for 60,000,000 records and then send them to you one by one.

    The second case, though, is open for comments. Retrieving and sending 60 million records is going to take a long time to the database. Is there any way that you do all or at least part of the calculation within the database itself? Any database server is capable of some fairly complex calculation that you could exploit before taking the records to the client. If your calculation is going to end up with, say, 100,000 records, the burden to the database is going to be a lot less than getting the whole dataset.
    If, for any reason, the DBMS is not able to do the entire calculation for you, you could at least try to reduce the number of records to fetch, by reviewing your algorithm, considering both the server and the client sides.

    Could you tell us more about the nature of your calculation? We might be able to give you some better advice.

    HTH
    _ _ _ _ (_|| | |(_|>< _|
Re: Big database queries
by Zaxo (Archbishop) on Aug 30, 2002 at 00:51 UTC

    You may have a LIMIT qualifier available:

    $sth = $dbh->prepare("select COLUMN_1 from table where COLUMN_2 < ? LIMIT ?, ?")

    I threw in a placeholder for the upper bound just for flexibility's sake. The first argument to LIMIT is the offset, the second, the number of rows to grab.

    After Compline,
    Zaxo

Re: Big database queries
by lhoward (Vicar) on Aug 30, 2002 at 00:52 UTC
    if you need to process a lot of data its going to put a lot of load on the server no matter what you do. If you want to lower the aparent load on the servers (at the expense of your query taking longer) you can use sleep (with Time::HiRes you can sleep for a partial second). One thing you can do to improve performance on your side is to use fetchrow_arrayref instead of fetchrow_array. That will at least cut down on one string copy...
      which will be 60 million string copies in our case... And I guess that'll make a difference. I once benchmarked fetchrow_array vs. fetchrow_arrayref in a CGI fetching some 50.000 rows, without any noticable difference, but at 60.000.000 rows that's another story. ;)
Re: Big database queries
by dws (Chancellor) on Aug 30, 2002 at 01:27 UTC
    I have a really simple script that queries a database then loops over each row like this: ... # do some processing on the column 1 value here

    Can you characterize the type of processing that you're doing? Could some or all of that processing be done by the database, either by using aggregate functions, or by using stored procedures?

Re: Big database queries
by screamingeagle (Curate) on Aug 30, 2002 at 05:27 UTC
    i'd suggest using a stored procedure. for one, the SQL in the stored procedure is precompiled, thus avoiding the need for the database to parse the query and create an execution plan every single time; secondly, since all the processing is done on the database, you could return back only the filtered results, thus cutting down on network traffic between the database server and the client machine....
    i'd also suggest using recordset paging techniques so that u only need to show 20-25 records to the user at a time , and provide "PREVIOUS" and "NEXT" buttons. this way, u dont have to fetch millions of records , just 25 records at a time....that should help.
Re: Big database queries
by Moonie (Friar) on Aug 30, 2002 at 01:23 UTC
    This isn't perl-y - but why not a stored procedure?
Re: Big database queries
by richardX (Pilgrim) on Aug 30, 2002 at 17:40 UTC
    What is the purpose of your query? To Summarize numeric data? A VLDB (Very Large Database) or DW (Data Warehouse) requires a different set of rules. And most the rules affect the database. Perl can only do so much until the database processing time overwhelms it. One trick to help with summary data is to pre-aggregate the summarized data in another table. Then you only query this smaller table. But your requirements determine which tables and columns are pre-aggregated. You cannot pre-aggregate all columns, so you must decide which ones are the important ones to report or query on.

    Example: 1 Billion rows in a Master table of customer data called CUST. CUST contains the following important columns: zip code, state, year to date purchases.

    Requirement: To summarize all the year to date sales for each state for each month.

    Solution: Create a table called SUMSTATE that contains the fields that you wish to query on from the CUST table. Add a column PERIOD that is defined as date time. PERIOD will contain the month and year.

    Processing Option One: On the fly, create the SUMSTATE table and populate it using a stored procedure that summarizes the data from CUST. This can be done incrementally so that the server or cluster is not impacted.

    Processing Option Two: Each time a row is added or updated in CUST, trigger a stored procedure that updates SUMSTATE.

    Processing Option Two is the preferred method, but each situation can be different.

    Richard

    There are three types of people in this world, those that can count and those that cannot. Anon

Re: Big database queries
by Stegalex (Chaplain) on Aug 30, 2002 at 13:56 UTC
    Sounds an awful lot like a Cartesian Product (which is an overly fancy way of saying the you have probably underqualified your "where" clause). Just for laughs, do a select count(*) from each of the tables involved. If there are nowhere near as many rows as 60 million then you probably need to add more joins to your "where" clause.

    ~~~~~~~~~~~~~~~
    I like chicken.
Re: Big database queries
by metlhed_ (Beadle) on Aug 30, 2002 at 15:05 UTC

    What are you getting all of this data for? If you are using Perl to do something with all the results after you get them, maybee you could use the database to get what you want then send you the results.

Re: Big database queries
by beebware (Pilgrim) on Sep 02, 2002 at 05:18 UTC
    You say it returns 60 million rows - how many rows were you expecting it to return? If you were only expecting aroun d100 - then you've probably got a incorrect JOIN causing a caesterian (sp?) loop thingy (where it matches up every row in one table in every row in another). The LIMIT statement may be a good idea as well, so may ensuring the columns your are SELECTing on are index columns (ok, it won't reduce the number of rows returned [or, it _shouldn't_ reduce], but it may decrease the database machine usage levels).

    What are you trying to achieve with such a big SELECT anyway? If you are meant to be handling 60million rows - have you got enough RAM in _BOTH_ machines: if it's only a 1k per row of data, you'll be looking at a _bare minimum_ recommendation of 60million X 1k X 3 for the database machine (171_Gb_ of RAM). Why those figures? Well, to allow the DB to load in the tables, compile the information, squirt it out and temporary space in between. If you have less than that amount of RAM, they'll probably be a _hell_ of a lot of disc-thrashing as data is moved to the swap area and back again.

    What is your exact SQL statement?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://193952]
Approved by simon.proctor
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2024-03-28 12:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found