in reply to Re: Speeding up Postgres INSERTs
in thread Speeding up Postgres INSERTs

I finally got a result back from one set of INSERTs. I am getting ~3300 insertions per second. I would consider that fairly slow.
--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re^3: Speeding up Postgres INSERTs
by JavaFan (Canon) on Jun 18, 2010 at 14:59 UTC
    3300 insertions/second comes to 30 seconds for 100K. I find 30 fairly large for "several". But is this number for 3300 separate queries, or a single query with 3300 rows? Of course, there are many other factors to consider, not of them mentioned in the original post, nor in any way perl related:
    1. How many indices are they? What kind of indices? Do they allow for new rows written at the end of the table, or must the table be rewritten?
    2. Are there any constraints? Foreign keys? Unique indices? Triggers? Auditting?
    3. How many rows are there already?
    4. How many other processes are using the table while you're doing inserts?
    5. What's the hardware the database is running on?
    6. How much memory does said hardware have?
    7. How "hot" is the table?
    8. Is the table being replicated?
    9. What's the physical layout of where the data is written to? What's the physical layout of where your logs are written to?
      JavaFan> 3300 insertions/second comes to 30 seconds for 100K. 
      JavaFan> I find 30 fairly large for "several". 
      
      Indeed. When I asked the original question, I didn't have the correct number. Now I do, for one instance of the job.
      JavaFan> But is this number for 3300 separate queries, or a single 
      JavaFan> query with 3300 rows?
      
      I am not sure what you mean by "a single query with 3300 rows." I mean $sth->execute() is called for each row, and a $dbh->commit() is called once every 100_000 rows. In the end, I get i rows inserted in j wallclock seconds (via Benchmark), and i / j is equal to approx. 3300.

      Other answers --

      • How many indices are they? What kind of indices? Do they allow for new rows written at the end of the table, or must the table be rewritten?
        • No indices except for a primary key. I don't allow or disallow anything. I am simply inserting the rows. I have no idea if the db is writing them at the end of table, but I am assuming it is. If it is rewriting the entire table everytime then it is a very silly db
      • Are there any constraints? Foreign keys? Unique indices? Triggers? Auditting?
        • none whatsoever for all of the above
      • How many rows are there already?
        • several hundred million
      • How many other processes are using the table while you're doing inserts?
        • No other processes. Except, now I have started experimenting with doing parallel INSERTs by starting new perl processes inserting different set of rows. See next question on hardware.
      • What's the hardware the database is running on?
        • It is a dual 3GHz Xeon Xserve with two cores each. Hence, my experimentation with four concurrent perl processes doing row insertions.
      • How much memory does said hardware have?
        • 12 GB
      Re. the rest of your questions:
      • How "hot" is the table? Don't know what that means.
      • Is the table being replicated? No.
      • What's the physical layout of where the data is written to? Don't know what that means.
      • What's the physical layout of where your logs are written to? Don't know.
      --

      when small people start casting long shadows, it is time to go to bed
        I am not sure what you mean by "a single query with 3300 rows." I mean $sth->execute() is called for each row, and a $dbh->commit() is called once every 100_000 rows.
        As I explained earlier, for many database, there's can be significant difference between creating a insert statement that inserts a single row, and calling execute() for each row, or a single statement that inserts thousands of rows in one shot. Apparently, you haven't tried out that variant.

        As for the other questions, they are important if you need maximum speed. Giving me the answers is useless; you'd have to consider it all. They are important to determine where you bottleneck is. Disks? The channels to your disks? Memory? CPU? But this all goes way beyond a web based Perl forum.

        If it is rewriting the entire table everytime then it is a very silly db
        Says you. There are storage engines that keep rows in primary key order. That's not silly. This makes searching on primary keys fast, and getting ranges fast as well. It entirely depends on the application how one wants a database tuned. It often makes sense to tune a database for fast retrieval, even if it makes inserts slower. Of course, 'rewriting' doesn't mean it's writing the entire table for each row. A page will get reordered. A full page may be split into two.
        How "hot" is the table? Don't know what that means.
        How much of the table is cached somewhere? (In-process pages, filesystem buffers (assuming you have the additional filesystem overhead), disk controller caches, etc).
        What's the physical layout of where the data is written to? Don't know what that means.
        Are you writing to a single disk? Multiple disks? SAN? Mirrorred disks? RAID? What RAID? Are you striping? How's your data storage connected? Copper? Fiber channel? If you have a 10 TB database, this matters. A lot. Specially since you've insignificant memory compared to size of the database. You may have to try out several setups to determine what works for you. You said diskspace is cheap. Sure, raw diskspace is cheap. Getting a 10 TB database to perform well may not be so cheap.

        There's a lot more to database that just having lots of disk space and a bunch of SQL statements.

        The more the better. How many rows you can fit in a single statement is usually delimited by the maximum package size.