Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I realize this is for perl related questions and that my question here might be a more of a database issue. I thought someone else might have some experience using perl with an Oracle database.

A co-worker and I have been using perl to load user provided data into an Oracle 9i database. The user data is variable length so we use a split and process if data is present in the fields.

Input files can be up to 200,000 lines long half a gig in size. Process counts show us how much of the input file is being processed. We use bind variables when doing database accesses.

5,000 lines would process in so much time so we'd get an estimate of how long a particular job might take. How many lines processed would increase slowly and our estimates were way off.

My question is why would a 65,000 line input file take 1 hour to process 30,000 lines but 8 hours to do the rest?

TIA

Replies are listed 'Best First'.
Re: processing slows way down
by Zaxo (Archbishop) on Jul 10, 2004 at 13:59 UTC

    That sounds like a scaling problem, but munching a sequence of independent records is generally linear in the size. From your description of the problem I think you are expecting that. Two possibilities come to mind: memory usage and database internals.

    Are you slurping the file or holding all the records in some global data structure? Driving the machine into swap gives a big performance hit. The solution is to keep records in a small structure which goes out of scope as soon as possible.

    Without knowing more details, it's hard to guess what db server operations might be a bottleneck. As a database grows, insertion becomes slower, but your numbers don't seem large enough to make that the issue. It's possible that a store of pending transactions is getting uncomfortably large. If so, try commiting your insertions more often.

    After Compline,
    Zaxo

      Thank you for responding.

      Each record from the input file is read one at a time, processed in a while until end of file and split into fields. The fields are processed if they contain data.

      Commits are done every 5,000 inserts.

      Like I said before, in the above way 30,000 input records got processed in an hour, the rest progressively got slower.

        Assuming you have good reasons not to use native sqlload and want to use Perl....

        1. Are you using prepare_cached and bind values?
        2. Are you sure you are not hitting SWAP secondary to running out of RAM? It sounds like it on the surface as this is a very common reason for scaling based slowdown.
        3. Is it due to indexing time? ie should you delete the indexes, load and then restore them. Indexing takes time so it makes sense to do it just the once at the end. If you are doing an update of course you want an index on that where field....

        Your load code should look roughly like:

        my $sth_i = $dbh->prepare_cached( "INSERT INTO foo (bar, baz,qux) VALU +ES(?,?,?)" ); # or my $sth_u = $dbh->prepare_cached( "UPDATE foo SET bar = ?, baz = ? WHE +RE qux = ?" ); # make it easy on perl ie declare loop vars outside loop to save creat +e/destroy # yes it should be optimised but it seems to help speed/memory consump +tion my @fields; # delete irrelevant indexes here before we start the load open IN, $infile or die $!; while(<IN>) { @field = split "\t"; # fail fast if you don't want to do anything - even before you cho +mp # lots of loops so every little bit of saving helps $sth_i->execute(@fields[0..2]); } # recreate indexes so you are not wasting time doing partial indexing

        Native C based loaders are always faster in my experience. You will have to experiment with the best commit frequency, possibly just one commit but memory may become an issue. With big loops it goes withous saying that optimising everything within the loop reaps benefits. 5 msec times 200K is a long time. I would stongly recommend posting your code. It may be you are making some very basic errors, at the very least there are probably optimisations to be had. I can load well over a million recs an hour into MySQL using pure Perl so you load rate sounds atrocious.....

        cheers

        tachyon

Re: processing slows way down
by graff (Chancellor) on Jul 10, 2004 at 15:15 UTC
    Whenever you have a really large amount of data to insert into a table, it is much better and faster to use a loading utility that is native to the database server. (In the case of Oracle, this would be a tool called sqlload, or something similar to that -- I don't know the exact spelling of the tool name in Oracle 9i.)

    It would be most effective to use a simple Perl script to create the text stream that will serve as input to the native dbms import tool, so that you have a chance to make sure that the input won't involve any really hazardous mistakes -- i.e. errors in the data that the dbms tool might find acceptable even though they are errors.

    I haven't done a recent systematic benchmark of Perl/DBI vs. sqlload, but the last time I had occasion to compare the two, doing row inserts via Perl/DBI was orders of magnitude slower than using sqlload.

      Thank you for your suggestion.

      Yes we've used sql loader and it's worked fine for us doing straight loads.

      We thought however with the complexity of the user data, having to translate their data to valid database data, combine existing input and database data to form new database data and being be able to create descriptive edit reports of the load, perl had a little more flexibility than sql loader.
        We thought however with the complexity of the user data, having to translate their data to valid database data, combine existing input and database data to form new database data and being be able to create descriptive edit reports of the load, perl had a little more flexibility than sql loader.

        You might be best off using Perl to process your user input but replacing your insert statements with prints to an output file that can subsequently be processed by the SQL loader.

Re: processing slows way down
by CountZero (Bishop) on Jul 10, 2004 at 14:23 UTC
    When doing large inserts on a database, sometimes you better delete the indexes on the tables you insert into and rebuild the indexes once all the inserts have been processed. The idea is that it is much more efficient to do all the indexing at once rather than in little steps.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Thank you for responding.

      I ran one 32,000 line input file in about an hour. I ran another 32,000 line input file right after also in about an hour, without rebuilding indexes.

      If what you suggest were true wouldn't each time an input file was loaded, without rebuilding the indexes it should get progressively slower?
        Not necessarily in such a way that one would see it easily. If you have a database with a million records and you first add 32,000 records and then again 32,000 records, the time-difference between the two runs is probably too small to notice.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: processing slows way down
by BrowserUk (Patriarch) on Jul 10, 2004 at 17:09 UTC

    One possibility is that you a have a badly scoped variable or structure that is retaining redundant information from record to record and growing to the point that it becomes harder and harder to allocated enough memory.

    A fairly simple test would be to add a

    my $recno = 0; while( <GET RECORD> ) { ... AddRecordToDB( ... ); do{ print 'Check memory:'; <STDIN>; } unless ++$recno % 1000; }
    at the bottom of the main loop and use ps or similar tool to monitor and record the tasks memory usage.

    If the memory growth increases substaintially from one check to the next, your leaking somewhere. Then it's a case of tracking down where.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      The fields for the user data are defined by my's at the top of the script. The data is read, falls into a while until end of file. In the while, the fields are interpreted, the table fields are formatted, the insert is done, repeat the while.

      We are not perl whizzes. The extent of our perl knowledge is pretty much what I explained above. We pick up bits of new bits of code as we come across new situations.

      The scope for what I described is through whole script because of the my's at the top of the script? I so, how could it be a scope issue?

      Thank you for trying to help.
        how could it be a scope issue?

        Without seeing the code, there really is no way to tell where the problem might lie. That's why I offered the suggestion on how you could check the problem out.

        It could be that there is a leak in one of the components your using to access Oracle. The simple test there is to comment out the DB code within the loop and re-run the script. If that stops the memory growth, you know where to look. If it doesn't, you also know where.

        All of which is pure speculation until you perform the test to see if memory leakage is the cause of the slowdown.

        Another possibility is that you are misusing DBI. There have been examples of code here at PM where people have been preparing their statements each time around the loop, rather than once outside and then executing them inside.

        Again, without sight of the code, pure speculation.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon