Giant proxy DBD for multiple databases

baku has asked for the wisdom of the Perl Monks concerning the following question:

Long have I travailled in the world of MVS, and I return with stupid questions, seeking enlightenment.

This is a very long question about SQL::Statement and doing a JOIN across multiple databases, so I'm tacking a READMORE here...

Here's the troubles: We have a set of disparate (geographically, logically, operating system, file structure) sets of files stored in ~50 systems, almost exclusively MVS. (We're looking at a lot of VSAM sets, which are similar to Berkeley DB, i.e. if you know the key, getting the value is easy, but if you want "column 4," you're going to have to scan the entire table; also some M204 stuff, which the NSA developed and nobody uses...) Our job is to put a query interface that can traverse multiple sources as if they were all in one big, happy database.

Think of it like:

SELECT a.users.name,c.users.year_of_birth
  FROM vsam://mvs.a.biz/USERDATA.APPNAME.FILEFMMF(XXXXXXXX)
         AS a,
       m204://mvs.b.biz/USERDATA.APPNAME.FILEFMMF(XXXXXXXX)
         AS b,
       db2mvs://mvs.c.biz/DB2/APPNAME/FILEFMMF/XXXXXXXX
         AS c
 WHERE     a.users.id = b.users.id
       AND b.users.x  = c.users.x
[download]

(Yes, not real SQL, but just imagine.) In fact, we can expect just table.column_name syntax coming in, with an explicit mapping of table => source/(table|file).

I told you there was a question coming, right? Disclaimer: I am not married to the idea of using Perl exclusively here... in fact, being able to present an alternative solution alongside Perl would probably bolster the case for using Perl to the bosses and customer both. (Both parties are pretty sure this Unix thing is going to fizzle and we'll all be learning SAS and COBOL in a few years. There's a profound distrust of anything that doesn't require a cleanroom. Thus, the only Unix they see is the AIX tape library attached to the 'frame that takes up 30' of wall space.) However:

It seems to me that this should be possible with SQL::Statement or DBD::AnyData or something similar, and a lot of work creating DBD::AnyData drivers for each of the back-end types (VSAM, flatfile dataset, M204, PDS, ...) that submit a simplified version of the query to the COBOL driver routines on the various hosts. (For reference: COBOL's memory allocation is essentially all static, fixed-length; as a result, it's easiest to parse things out into COBOLese fixed-length records, convert them to EBCDIC, and shove them down a TCP/IP pipe to the very exciting world of CICS+TCP/IP, which is basically a fixed-formatting version of ye olde Berkeley Sockets.) We'll have to create these query interfaces, so I'm not worried about that bit, but DBD::Proxy gives me the willies. Heavy magic.

Also, we have to keep in mind that one query could very easily return many, many gigabytes of data to the middle-level DBD:: to do the JOIN across. That means no in-core processing allowed, we have to stick to the disc for this. Also, parallelizing the query is an absolute necessity.

Am I even looking in the right place? Is there a commercial widget out there that can do the hard bits of this (performing the "join")?

As an aside, I'd love to hear about it if anyone's ever tried accessing MVS datasets from Perl on the 'frame itself (preferably, running in batch-perl under JES/MVS, but even running in a Linux LPAR would be useful.)

For reference, our "small" test dataset is about 100 billion rows in DB/2 on the 'frame.

Missing the *n*x universe more every day I have to reallocate a dataset to allow it to grow...

Baku Kurae

Comment on Giant proxy DBD for multiple databases Select or Download Code

Replies are listed 'Best First'.
Re: Giant proxy DBD for multiple databases by dws (Chancellor) on Mar 08, 2002 at 18:34 UTC
It seems to me that this should be possible with SQL::Statement or DBD::AnyData or something similar, and a lot of work creating DBD::AnyData drivers for each of the back-end types (VSAM, flatfile dataset, M204, PDS, ...) that submit a simplified version of the query to the COBOL driver routines on the various hosts. If I were unable to find something off-the-shelf, subclassing or modifying SQL::Statement the direction I'd go. It's relatively easy to parse an extended SQL grammar. It's harder to build an execution plan that handles heterogenous, distributed data. That's where I suspect you'll be spending the bulk of your time. For reference, our "small" test dataset is about 100 billion rows in DB/2 on the 'frame. I'm saving that line as a reminder to have people quantify what they mean by "small". :)	[reply]
Re: Giant proxy DBD for multiple databases by derby (Abbot) on Mar 08, 2002 at 18:46 UTC
baku, derby scratches his head and tries to remember all that mainframe crap he was forced to learn early in his career. We had a similar set-up (M204, VSAM, MVS, JCL, etc etc). I got damn tired of login into TSO, tweaking some JCL, submitting etc. so I wrote a perl script (4.x) that would (no DBI back then) merge a bunch of JCL, submit it to the mainframe and then wait for the data to be deposited in his nix account. Now I had some major differences from your set-up. First, our M204 database were backed up to VSAM files on a nightly basis and there were existing JCL snippets (what were those called exactly) that could query M204 directly if necessary (rarely was). So having to go against VSAM only (via JCL) simplified things immensely. The second difference was we had snippets of JCL that could download data to a nix box. That was fantastic. My perl script could just spin it's wheels waiting for the file to show up. I would then do some follow on processing in nix. Now for your application, you don't have to necessarily download all the data but you would need something to let you know the newly created dataset was available (which given the size of you're data, you may want to asynch that somehow). Once that temp dataset is there, you could retrieve it in chunks. Sorry if this isn't much help. Just wanted to let you know I feel your pain. I think if you can get as much possible into a VSAM dataset and then use JCL (or perl on the mainframe) you should be able to easily develop a DBD driver for fetching. -derby update:* I also remember being excited about EDA/SQL from ibi but it never made it out of our labs. I think that product has been rolled into a new company iwaysoftware. But I cannot comment one way or another about their products.	[reply]
Re: Re: Giant proxy DBD for multiple databases by baku (Scribe) on Mar 19, 2002 at 14:53 UTC
Our super-COBOL-expert people said we'd have to write custom COBOL to do each query, since the VSAM's are indexed only on the "primary key," like DB_File's, and we need to query against any "key" in the table. We did look into doing a retrieval by FTP but ran into the asynch notify "problem," which we "solved" (on paper) by passing a "job completed code" to the COBOL driver when we began the query, and have it "hit" a TCP port back on our side with this code to indicate completion. `Unix: EBCDIC structured data JOB-ID PIC XXXXXXXXXXXXXXXX SUB-DATE-TIME 9999999999999999 SUB-DATE 99999999 SUB-DATE-YR 9999 SUB-DATE-MO 99 SUB-DATE-DY 99 SUB-TIME 99999999 SUB-TIME-H 99 SUB-TIME-M 99 SUB-TIME-S 99 SUB-TIME-CS 99 REPLY-IP 99999999999999999 REPLY-IP-H1 999 REPLY-IP-H2 999 REPLY-IP-H3 999 REPLY-IP-H4 999 REPLY-IP-PORT 99999 QUERY-1ST-COL 999 QUERY-LAST-COL 999 ... yadda, yadda ...` [download] Really interesting trying to express TCP streams in fixed-length fields like this. Very similar concepts to C's "union"s or "struct"s; but no binary data, because (many/most) interfaces will mangle anything outside of the basic EBCDIC printable character-set. (We had to use "special" functions just to preserve lowercase letters. God only knows what happens to Iñigo and Réné as they go through the mangler... Unfortunately, the contract was yanked at the last minute by the agency offering it. Apparently they thought this entire project would take 900 man-hours to complete and were a mite surprised at a 7-digit bid. Thanks to everyone for your contributions, though!	[reply] [d/l]