Re: Automating data loading into Oracle

I do this sort of thing quite a lot, but generally using perl/java/javascript as a component within a framework like Maxware DSE or Critical Path IMD. One thing you might want to consider (depending on your speed requirements) is converging the data into a single consolidated location before pushing it out to the various tables in Oracle.

Gotchas to look out for:

Authority of data. If you are writing to a single column or record from several datasources, which one has priority?

Mismatch in column length between data source & data target - get a list of columns for source and target and compare them

Data type mismatches - perhaps a column that has been defined in the target as NUM, but there's some rogue data in the source containing strings, or the target is BOOL and the source is a single character VARCHAR that's been populated with a mixture of Y/y/N/n/0/1 etc.

How to detect and handle deletion of records/column entries

Logging: make sure your day to day logging contains enough information to identify a problem record in the source data.

What happens if your script(s) barf in the middle of the job and end unexpectedly?

Character encoding & international character sets: do you need to handle them, and if so which ones and how

Testing:

I'd suggest using Test modules, and comparing the resulting output data with what you are expecting. Use something like Devel::Cover to make sure your tests cover all possible transformations

If at all possible, take a copy of live data to test against, so when you come to run in live there are no nasty surprises waiting for you. When contructing artificial test data it's always tempting to use 'test user 1', 'test user 2' etc, which may not test every scenario. To give a simple example, testing simple test data could leave character quoting problems waiting if you have live users with names like "O'Connor". Probably the easiest way, since you are replacing an existing procedure, is to take a copy of live source and target data, run your scripts to transform the data, and compare results with live target data.

Data::Sync was written as an alternative to commercial tools that do this sort of job, but it's quite new and might as yet be too basic for your needs.

One trick I've occasionally used for complex transformations is the following:

 read from source to temporary database
 for (defined operations)
 { 
  read from temp database x
  transform
  write to temp database y
 }
 write to target
[download]

Although obviously that has a significant impact on the speed of the whole operation.

--------------------------------------------------------------

"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."

John Brunner, "The Shockwave Rider".

Comment on Re: Automating data loading into Oracle Download Code

Replies are listed 'Best First'.
Re^2: Automating data loading into Oracle by punkish (Priest) on Oct 24, 2005 at 22:20 UTC
This is good! I will check out Data::Sync, but I might have to stick to DBI, various logging modules, and testing modules. One important point you bring up is "recovery." What should I do if some goes kerbonk? What _should_ I do in that case? Do I log an error and merrily go on doing what I was doing? Do I ring an alarm bell? Do I abort that task? Of course, logging is going to be very important, not just to cover my arse but also assist the client to cover theirs as well. Speed is not that critical a factor, and getting all data in one place may also be do-able. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re^3: Automating data loading into Oracle by g0n (Priest) on Oct 25, 2005 at 08:47 UTC
What you do about your script keeling over depends on a number of things: what stage it's at when the error occurs, how critical the target data consistency is, whether you're implementing deletions, etc. Two examples of things that can go horribly, horribly wrong: You have an application which assumes that field X in a source file (e.g. email address) is unique. Your application therefore uses that as a primary key in your consolidation and transformations. Your application also (for safeties sake) doesn't delete entries that are no longer in the file. The supplier of the file makes a mistake in generation and supplies you with a file with an extra field at the beginning of the record. Bang! Your data is now all over the place (this has happened on several projects I've been on). Conversely (and this has happened to me several times as well) you have implemented record deletion - if a record is no longer in the file, you delete it. So one day the application generating the source file dies partway through, you get a partial file, and you end up deleting a chunk of the data in the target. The cold thrill you get when you suddenly notice your target database is a tenth the size it should be is quite memorable. The way I've got around the former in the next version of Data::Sync is to implement a 'validation' method. You can specify a pattern match for every field - before writing, it goes through the candidate record set, checking that the every field matches it's pattern match in every record. If it fails, it doesn't write. Implementing a check like that is one of the reasons I suggested pulling everything together into a single consolidated view before writing to the target(s). `$DBI::AutoCommit=0` might be useful for this - if your script fails anywhere short of the line `$dh->commit()` it auto rollbacks and nothing is damaged. Another approach to record deletion/mangling is to allow a configured maximum number of changes in an operation. If the number of changes is greater than x% of the total record set it stops with an error. You may want to implement something similar. How critical an error is can depend on all sorts of things, but is probably best assessed with a (semi) formal risk assessment. E.g. If an error renders the data in your company phone directory missing or incorrect (I'm using user information as an example because identity management systems are what I'm most familiar with) it makes you look foolish, and causes some annoyance. OTOH, if you push that out to the login database, you suddenly have x% of the company sat twiddling their thumbs, unable to login until you fix it. This costs the company a lot of money, and can cause severe bosses shoe induced bruising on your behind. There's lots of information around on risk assessment, and really it's the job of the (project) manager, but if you find yourself doing it, you should take into account at least the following: Likelihood of a problem happening Probable impact on the system if it does happen Criticality of the system Likely time to fix (can be included in impact) Cost (in terms of extra coding in your case) of mitigating the risk -------------------------------------------------------------- "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing." John Brunner, "The Shockwave Rider".	[reply] [d/l] [select]