in reply to Automating data loading into Oracle

I do this sort of thing quite a lot, but generally using perl/java/javascript as a component within a framework like Maxware DSE or Critical Path IMD. One thing you might want to consider (depending on your speed requirements) is converging the data into a single consolidated location before pushing it out to the various tables in Oracle.

Gotchas to look out for:

Testing:

Data::Sync was written as an alternative to commercial tools that do this sort of job, but it's quite new and might as yet be too basic for your needs.

One trick I've occasionally used for complex transformations is the following:

read from source to temporary database for (defined operations) { read from temp database x transform write to temp database y } write to target

Although obviously that has a significant impact on the speed of the whole operation.

--------------------------------------------------------------

"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."

John Brunner, "The Shockwave Rider".

Replies are listed 'Best First'.
Re^2: Automating data loading into Oracle
by punkish (Priest) on Oct 24, 2005 at 22:20 UTC
    This is good! I will check out Data::Sync, but I might have to stick to DBI, various logging modules, and testing modules.

    One important point you bring up is "recovery." What should I do if some goes kerbonk? What _should_ I do in that case? Do I log an error and merrily go on doing what I was doing? Do I ring an alarm bell? Do I abort that task?

    Of course, logging is going to be very important, not just to cover my arse but also assist the client to cover theirs as well.

    Speed is not that critical a factor, and getting all data in one place may also be do-able.

    --

    when small people start casting long shadows, it is time to go to bed
      What you do about your script keeling over depends on a number of things: what stage it's at when the error occurs, how critical the target data consistency is, whether you're implementing deletions, etc.

      Two examples of things that can go horribly, horribly wrong:

      • You have an application which assumes that field X in a source file (e.g. email address) is unique. Your application therefore uses that as a primary key in your consolidation and transformations. Your application also (for safeties sake) doesn't delete entries that are no longer in the file. The supplier of the file makes a mistake in generation and supplies you with a file with an extra field at the beginning of the record. Bang! Your data is now all over the place (this has happened on several projects I've been on).
      • Conversely (and this has happened to me several times as well) you have implemented record deletion - if a record is no longer in the file, you delete it. So one day the application generating the source file dies partway through, you get a partial file, and you end up deleting a chunk of the data in the target. The cold thrill you get when you suddenly notice your target database is a tenth the size it should be is quite memorable.
      The way I've got around the former in the next version of Data::Sync is to implement a 'validation' method. You can specify a pattern match for every field - before writing, it goes through the candidate record set, checking that the every field matches it's pattern match in every record. If it fails, it doesn't write. Implementing a check like that is one of the reasons I suggested pulling everything together into a single consolidated view before writing to the target(s).

      $DBI::AutoCommit=0 might be useful for this - if your script fails anywhere short of the line $dh->commit() it auto rollbacks and nothing is damaged.

      Another approach to record deletion/mangling is to allow a configured maximum number of changes in an operation. If the number of changes is greater than x% of the total record set it stops with an error. You may want to implement something similar.

      How critical an error is can depend on all sorts of things, but is probably best assessed with a (semi) formal risk assessment. E.g. If an error renders the data in your company phone directory missing or incorrect (I'm using user information as an example because identity management systems are what I'm most familiar with) it makes you look foolish, and causes some annoyance. OTOH, if you push that out to the login database, you suddenly have x% of the company sat twiddling their thumbs, unable to login until you fix it. This costs the company a lot of money, and can cause severe bosses shoe induced bruising on your behind.

      There's lots of information around on risk assessment, and really it's the job of the (project) manager, but if you find yourself doing it, you should take into account at least the following:

      • Likelihood of a problem happening
      • Probable impact on the system if it does happen
      • Criticality of the system
      • Likely time to fix (can be included in impact)
      • Cost (in terms of extra coding in your case) of mitigating the risk

      --------------------------------------------------------------

      "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."

      John Brunner, "The Shockwave Rider".