Re: Automating data loading into Oracle
by g0n (Priest) on Oct 22, 2005 at 09:51 UTC
|
I do this sort of thing quite a lot, but generally using perl/java/javascript as a component within a framework like Maxware DSE or Critical Path IMD. One thing you might want to consider (depending on your speed requirements) is converging the data into a single consolidated location before pushing it out to the various tables in Oracle.
Gotchas to look out for:
- Authority of data. If you are writing to a single column or record from several datasources, which one has priority?
- Mismatch in column length between data source & data target - get a list of columns for source and target and compare them
- Data type mismatches - perhaps a column that has been defined in the target as NUM, but there's some rogue data in the source containing strings, or the target is BOOL and the source is a single character VARCHAR that's been populated with a mixture of Y/y/N/n/0/1 etc.
- How to detect and handle deletion of records/column entries
- Logging: make sure your day to day logging contains enough information to identify a problem record in the source data.
- What happens if your script(s) barf in the middle of the job and end unexpectedly?
- Character encoding & international character sets: do you need to handle them, and if so which ones and how
Testing:
- I'd suggest using Test modules, and comparing the resulting output data with what you are expecting. Use something like Devel::Cover to make sure your tests cover all possible transformations
- If at all possible, take a copy of live data to test against, so when you come to run in live there are no nasty surprises waiting for you. When contructing artificial test data it's always tempting to use 'test user 1', 'test user 2' etc, which may not test every scenario. To give a simple example, testing simple test data could leave character quoting problems waiting if you have live users with names like "O'Connor". Probably the easiest way, since you are replacing an existing procedure, is to take a copy of live source and target data, run your scripts to transform the data, and compare results with live target data.
Data::Sync was written as an alternative to commercial tools that do this sort of job, but it's quite new and might as yet be too basic for your needs.
One trick I've occasionally used for complex transformations is the following:
read from source to temporary database
for (defined operations)
{
read from temp database x
transform
write to temp database y
}
write to target
Although obviously that has a significant impact on the speed of the whole operation.
--------------------------------------------------------------
"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
John Brunner, "The Shockwave Rider".
| [reply] [d/l] |
|
|
This is good! I will check out Data::Sync, but I might have to stick to DBI, various logging modules, and testing modules.
One important point you bring up is "recovery." What should I do if some goes kerbonk? What _should_ I do in that case? Do I log an error and merrily go on doing what I was doing? Do I ring an alarm bell? Do I abort that task?
Of course, logging is going to be very important, not just to cover my arse but also assist the client to cover theirs as well.
Speed is not that critical a factor, and getting all data in one place may also be do-able.
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
|
|
What you do about your script keeling over depends on a number of things: what stage it's at when the error occurs, how critical the target data consistency is, whether you're implementing deletions, etc.
Two examples of things that can go horribly, horribly wrong:
- You have an application which assumes that field X in a source file (e.g. email address) is unique. Your application therefore uses that as a primary key in your consolidation and transformations. Your application also (for safeties sake) doesn't delete entries that are no longer in the file. The supplier of the file makes a mistake in generation and supplies you with a file with an extra field at the beginning of the record. Bang! Your data is now all over the place (this has happened on several projects I've been on).
- Conversely (and this has happened to me several times as well) you have implemented record deletion - if a record is no longer in the file, you delete it. So one day the application generating the source file dies partway through, you get a partial file, and you end up deleting a chunk of the data in the target. The cold thrill you get when you suddenly notice your target database is a tenth the size it should be is quite memorable.
The way I've got around the former in the next version of Data::Sync is to implement a 'validation' method. You can specify a pattern match for every field - before writing, it goes through the candidate record set, checking that the every field matches it's pattern match in every record. If it fails, it doesn't write. Implementing a check like that is one of the reasons I suggested pulling everything together into a single consolidated view before writing to the target(s).
$DBI::AutoCommit=0 might be useful for this - if your script fails anywhere short of the line $dh->commit() it auto rollbacks and nothing is damaged.
Another approach to record deletion/mangling is to allow a configured maximum number of changes in an operation. If the number of changes is greater than x% of the total record set it stops with an error. You may want to implement something similar.
How critical an error is can depend on all sorts of things, but is probably best assessed with a (semi) formal risk assessment. E.g. If an error renders the data in your company phone directory missing or incorrect (I'm using user information as an example because identity management systems are what I'm most familiar with) it makes you look foolish, and causes some annoyance. OTOH, if you push that out to the login database, you suddenly have x% of the company sat twiddling their thumbs, unable to login until you fix it. This costs the company a lot of money, and can cause severe bosses shoe induced bruising on your behind.
There's lots of information around on risk assessment, and really it's the job of the (project) manager, but if you find yourself doing it, you should take into account at least the following:
- Likelihood of a problem happening
- Probable impact on the system if it does happen
- Criticality of the system
- Likely time to fix (can be included in impact)
- Cost (in terms of extra coding in your case) of mitigating the risk
--------------------------------------------------------------
"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
John Brunner, "The Shockwave Rider".
| [reply] [d/l] [select] |
Re: Automating data loading into Oracle
by pg (Canon) on Oct 22, 2005 at 03:53 UTC
|
I did something similar before, try to delegate as much as possible to Oracle SQL*Loader, and write as less code as you can.
SQL*Loader is a great tool to bulk load data into Oracle database, and it helps you to migrate data from file system or other database systems.
| [reply] |
|
|
| [reply] |
Re: Automating data loading into Oracle
by poqui (Deacon) on Oct 24, 2005 at 18:42 UTC
|
Wow! I have joked about this kind of thing, but never thought anyone would actually consider it.
Where I work, we use Informatica for our ETL. It works quite nicely, and fairly efficiently.
A year or so ago, my team decided that the current set of marts that had evolved were not truly a data warehouse, and that it was time to design a true data warehouse that could more efficently produce the sorts of marts that more people in our company would find useful.
It was at *that* time that the question of doom was laid upon us: "What tools do you need to make your project successful?"
I guess my main point in all of this is, ETL itself is not the biggest issue one faces when doing data warehousing, and just about any tool can move data effectively from one platform to another. Metadata, when addressed from the outset, can impact your success much more, and is directly affected by your choice of ETL tools. | [reply] |
|
|
No jokes here. This is serious. The tool is already chosen -- the client wants Shell/Perl (I am the one who is whispering "shell" and raising my volume when saying "Perl").
Fortunately, there are resources available to gather the metadata, do an assessment, study all the existing data flows, sources, transformation, and target. So, I do have some say in determining the effort. It is hard though, to try and guesstimate how much time a particular task will take, so all my numbers right now are very SWAG-ish.
--
when small people start casting long shadows, it is time to go to bed
| [reply] |
|
|
When you say "resources available to gather the metadata", what exactly do you mean?
Do you mean that someone is going in after you write the transformations in perl and translating that into english? Or are you simply referring to the existing process that perl is replacing?
How are you capturing the metadata about the transformations you are putting the data through? In Informatica, for instance, the transformations are saved in an RDBMS, but still in their code form. If I want english translations for those transformations, I have to do that myself; plus in my Metadata Repository, I want the english explanations from the requirements document that drove the development of the code.
In order for all of that to make sense across anything other than a small company, I need to capture definitions of terms, and who defined the terms and by what authority.
There is also metadata generated by the process of ETL itself: When the last load ran, was it complete, was it successful, how many rows did it load, was it on time, if it wasnt complete why...
| [reply] |
|
|