in reply to Automating data loading into Oracle

Wow! I have joked about this kind of thing, but never thought anyone would actually consider it.

Where I work, we use Informatica for our ETL. It works quite nicely, and fairly efficiently.
A year or so ago, my team decided that the current set of marts that had evolved were not truly a data warehouse, and that it was time to design a true data warehouse that could more efficently produce the sorts of marts that more people in our company would find useful.
It was at *that* time that the question of doom was laid upon us: "What tools do you need to make your project successful?"

We set out on a quest: to find the tools that would make us successful. With a consultant in tow to help us, we came up with some criteria.
* Parallel Execution
* compatibility with external scheduler (we use CA7)
* Changed Data Capture (with DB2)
* integrated Data Profiler
* integrated Metadata gathering and Repository

We weighted the criteria and evaluated several components. We arrived at the decision that Ab Initio was the best tool in most categories.
We put our recommendations together in a presentation and went to management, who promptly said "Thats too expensive".

It was at that point that I joked (out of earshot of the VP of course) that we should write our own ETL with Perl.

Let me explain the criteria that we used, in hopes of helping you avoid some of our problems.

First, and most importantly for us, was the matter of Metadata. The existing set of marts were developed for a small subset of the company, which was originally a seperate company purchased by a much larger entity. They worked fine for the smaller company, but had serious problems when the information was to be presented to other parts of the larger company.

Metadata was proposed as a stepping stone from effectiveness for the subset, to effectiveness for the superset. If we could document the business rules used to collect and present the data in the Marts, we would then have the information we needed to either convert/ explain the data in the mart for general consumption, or have the ammunition needed to justify re-development of the mart for a wider audience.

Without a metadata plan in place, we spent a lot of time researching what had been done to the data on its way to the mart so we could explain why it didn't look the way the business partners expected.
We liked Ab Initio's Metadata plan a bit better than Informatica's because the metadata model was more accessible and extensible, and the level of detail was greater.

Ab Initio's parallel execution is integrated better into the tool than Informatica's. Informatica has parallel execution at the server level, that is, if you want parallelism, you run it on more than 1 server. Whereas Ab Initio has parallelism down into the maps/ graphs/ programs themselves.

Both tools offer Changed Data Capture, which for us is essential because there are applications on the source system that allow changes to the data without requiring the change timestamp to be updated, as well as users with direct access to the data making un-audited changes. Both tools interface with the DB2 standard log interface to search for change entries on the requested tables.

Finally, the last 2 criteria, schedulable via external tools, and Profiler integration, were fairly equal. Ab Initio generates Kornshell scripts that can be scheduled by just about anything. Informatica has similar hooks, but not as accessible as kornshell. Both tools have data profiling tools available also; but Ab Initio's seem to run better.

I guess my main point in all of this is, ETL itself is not the biggest issue one faces when doing data warehousing, and just about any tool can move data effectively from one platform to another. Metadata, when addressed from the outset, can impact your success much more, and is directly affected by your choice of ETL tools.
  • Comment on Re: Automating data loading into Oracle

Replies are listed 'Best First'.
Re^2: Automating data loading into Oracle
by punkish (Priest) on Oct 24, 2005 at 22:24 UTC
    No jokes here. This is serious. The tool is already chosen -- the client wants Shell/Perl (I am the one who is whispering "shell" and raising my volume when saying "Perl").

    Fortunately, there are resources available to gather the metadata, do an assessment, study all the existing data flows, sources, transformation, and target. So, I do have some say in determining the effort. It is hard though, to try and guesstimate how much time a particular task will take, so all my numbers right now are very SWAG-ish.

    --

    when small people start casting long shadows, it is time to go to bed
      When you say "resources available to gather the metadata", what exactly do you mean?

      Do you mean that someone is going in after you write the transformations in perl and translating that into english? Or are you simply referring to the existing process that perl is replacing?

      How are you capturing the metadata about the transformations you are putting the data through? In Informatica, for instance, the transformations are saved in an RDBMS, but still in their code form. If I want english translations for those transformations, I have to do that myself; plus in my Metadata Repository, I want the english explanations from the requirements document that drove the development of the code.

      In order for all of that to make sense across anything other than a small company, I need to capture definitions of terms, and who defined the terms and by what authority.

      There is also metadata generated by the process of ETL itself: When the last load ran, was it complete, was it successful, how many rows did it load, was it on time, if it wasnt complete why...
        When you say "resources available to gather the metadata", what exactly do you mean? Do you mean that someone is going in after you write the transformations in perl and translating that into english? Or are you simply referring to the existing process that perl is replacing?
        I mean the latter... understanding and "decoding" the current procedures (that someone else wrote... they are mainly DTS packages and the information inside one guy's head) into plain English.

        Once I recreate the ETL processes in Perl, they will all be well documented, and I will leave a conspicuous log trail. So, together that will make for good metadata anyway.

        --

        when small people start casting long shadows, it is time to go to bed