Re: processing slows way down
by Zaxo (Archbishop) on Jul 10, 2004 at 13:59 UTC
|
That sounds like a scaling problem, but munching a sequence of independent records is generally linear in the size. From your description of the problem I think you are expecting that. Two possibilities come to mind: memory usage and database internals.
Are you slurping the file or holding all the records in some global data structure? Driving the machine into swap gives a big performance hit. The solution is to keep records in a small structure which goes out of scope as soon as possible.
Without knowing more details, it's hard to guess what db server operations might be a bottleneck. As a database grows, insertion becomes slower, but your numbers don't seem large enough to make that the issue. It's possible that a store of pending transactions is getting uncomfortably large. If so, try commiting your insertions more often.
| [reply] |
|
|
Thank you for responding.
Each record from the input file is read one at a time, processed in a while until end of file and split into fields. The fields are processed if they contain data.
Commits are done every 5,000 inserts.
Like I said before, in the above way 30,000 input records got processed in an hour, the rest progressively got slower.
| [reply] |
|
|
my $sth_i = $dbh->prepare_cached( "INSERT INTO foo (bar, baz,qux) VALU
+ES(?,?,?)" );
# or
my $sth_u = $dbh->prepare_cached( "UPDATE foo SET bar = ?, baz = ? WHE
+RE qux = ?" );
# make it easy on perl ie declare loop vars outside loop to save creat
+e/destroy
# yes it should be optimised but it seems to help speed/memory consump
+tion
my @fields;
# delete irrelevant indexes here before we start the load
open IN, $infile or die $!;
while(<IN>) {
@field = split "\t";
# fail fast if you don't want to do anything - even before you cho
+mp
# lots of loops so every little bit of saving helps
$sth_i->execute(@fields[0..2]);
}
# recreate indexes so you are not wasting time doing partial indexing
Native C based loaders are always faster in my experience. You will have to experiment with the best commit frequency, possibly just one commit but memory may become an issue. With big loops it goes withous saying that optimising everything within the loop reaps benefits. 5 msec times 200K is a long time. I would stongly recommend posting your code. It may be you are making some very basic errors, at the very least there are probably optimisations to be had. I can load well over a million recs an hour into MySQL using pure Perl so you load rate sounds atrocious.....
| [reply] [d/l] |
Re: processing slows way down
by graff (Chancellor) on Jul 10, 2004 at 15:15 UTC
|
Whenever you have a really large amount of data to insert into a table, it is much better and faster to use a loading utility that is native to the database server. (In the case of Oracle, this would be a tool called sqlload, or something similar to that -- I don't know the exact spelling of the tool name in Oracle 9i.)
It would be most effective to use a simple Perl script to create the text stream that will serve as input to the native dbms import tool, so that you have a chance to make sure that the input won't involve any really hazardous mistakes -- i.e. errors in the data that the dbms tool might find acceptable even though they are errors.
I haven't done a recent systematic benchmark of Perl/DBI vs. sqlload, but the last time I had occasion to compare the two, doing row inserts via Perl/DBI was orders of magnitude slower than using sqlload. | [reply] |
|
|
Thank you for your suggestion.
Yes we've used sql loader and it's worked fine for us doing straight loads.
We thought however with the complexity of the user data, having to translate their data to valid database data, combine existing input and database data to form new database data and being be able to create descriptive edit reports of the load, perl had a little more flexibility than sql loader.
| [reply] |
|
|
| [reply] |
|
|
Re: processing slows way down
by CountZero (Bishop) on Jul 10, 2004 at 14:23 UTC
|
When doing large inserts on a database, sometimes you better delete the indexes on the tables you insert into and rebuild the indexes once all the inserts have been processed. The idea is that it is much more efficient to do all the indexing at once rather than in little steps.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
|
|
Thank you for responding.
I ran one 32,000 line input file in about an hour. I ran another 32,000 line input file right after also in about an hour, without rebuilding indexes.
If what you suggest were true wouldn't each time an input file was loaded, without rebuilding the indexes it should get progressively slower?
| [reply] |
|
|
Not necessarily in such a way that one would see it easily. If you have a database with a million records and you first add 32,000 records and then again 32,000 records, the time-difference between the two runs is probably too small to notice.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
Re: processing slows way down
by BrowserUk (Patriarch) on Jul 10, 2004 at 17:09 UTC
|
One possibility is that you a have a badly scoped variable or structure that is retaining redundant information from record to record and growing to the point that it becomes harder and harder to allocated enough memory.
A fairly simple test would be to add a
my $recno = 0;
while( <GET RECORD> ) {
...
AddRecordToDB( ... );
do{
print 'Check memory:';
<STDIN>;
} unless ++$recno % 1000;
}
at the bottom of the main loop and use ps or similar tool to monitor and record the tasks memory usage.
If the memory growth increases substaintially from one check to the next, your leaking somewhere. Then it's a case of tracking down where.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
| [reply] [d/l] [select] |
|
|
The fields for the user data are defined by my's at the top of the script. The data is read, falls into a while until end of file. In the while, the fields are interpreted, the table fields are formatted, the insert is done, repeat the while.
We are not perl whizzes. The extent of our perl knowledge is pretty much what I explained above. We pick up bits of new bits of code as we come across new situations.
The scope for what I described is through whole script because of the my's at the top of the script? I so, how could it be a scope issue?
Thank you for trying to help.
| [reply] |
|
|
how could it be a scope issue?
Without seeing the code, there really is no way to tell where the problem might lie. That's why I offered the suggestion on how you could check the problem out.
It could be that there is a leak in one of the components your using to access Oracle. The simple test there is to comment out the DB code within the loop and re-run the script. If that stops the memory growth, you know where to look. If it doesn't, you also know where.
All of which is pure speculation until you perform the test to see if memory leakage is the cause of the slowdown.
Another possibility is that you are misusing DBI. There have been examples of code here at PM where people have been preparing their statements each time around the loop, rather than once outside and then executing them inside.
Again, without sight of the code, pure speculation.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
| [reply] [d/l] |
|
|