in reply to ETL in Perl

I recently assisted on some ETL work using Microsoft’s toolsets ... and soon enough wound up piecing together a little bit of Perl code to work with the underlying (XML-based) definition files of that system.   (I also wound up writing quite a bit of Visual Basic code ... (yuck!) ... to do a great many things that the pure-visual environment could not “quite” do.

As the process became bigger, it also became more complex, and it became increasingly difficult to sit down and feel like you actually understood what was going to happen when you mashed the Start button.   I felt very uncomfortable with that.   A graph is only comprehensible when it fits on a single uncluttered page.

Maybe I am just an old Luddite, but I really do embrace having source code as the basic way of defining to the computer what I want the computer to do.   I know how to diff such files.   I know how to work with them easily in version-control.

Having said that ... I, too, would like avoid having to write and to maintain “large amounts of code” by hand in any language; including Perl.   I would look for (or build?) some system that allowed me to define the processes, the data relationships and so-on, and which would then just-in-time generate the necessary (Perl, of course...) source-code.

Dispatching the resulting definitions for parallel execution is a different problem, and a relatively easy one to handle “generically.”   (Where I am parked right now, a rather archaic version of Tivoli Workload Scheduler is performing that task quite well.   It smells bad but it works.)   The advantage here is that any sort of workload can be dispatched in this way... “ETL” or otherwise.   It is very limiting if “ETL works this-way but nothing else does,” or when the system that is doing ETL has no way to balance itself against other work that might be going on upon the same machine at the same time.

Replies are listed 'Best First'.
Re^2: ETL in Perl
by runrig (Abbot) on Sep 16, 2010 at 20:24 UTC
    Ditto on the ability to diff and version control the source. I've worked with Informatica where you work with GUI tools that on the backend stores your mappings and workflows in a database (that you really don't want to look at, but sometimes do anyway). There is a built in version control that sucks, and you can not easily see the changes you make or changes between versions. You can export your work as XML, and if you sort and filter it just right (which I did with Perl and XSLT), you can diff XML files. You miss the ability to grep for something, instead you have to click and click (and click...) to see the bit you want to see, and there are so many levels of places to override settings, it's sometimes a challenge to figure out why something is behaving the way it is. The one (and maybe only) thing I like about it are that you get logging with little to no effort, and (ok, two things) realtime control and monitoring of the processes (in a pretty and fairly useful GUI).