in reply to Re^3: Finding Minimum Value
in thread Finding Minimum Value

I might have over-reacted, but, true, you said "think database", not "use database". Sorry if I interpreted wrongly what you said.

My work is to guarantee data consistency between several hundreds applications of the Information System of a Tier One telecommunication operator (cell phones, landlines, Internet, TV, VOIP, VOD, etc.). The various applications run on MVS, AS400, AIX, HP-UX, Solaris, VMS, at least half a dozen distributions of Linux, etc. The database systems used are perhaps even more varied.

Now, how to you compare data from a DB2 database running on MVS with an Oracle database running on AIX or a RMS database running on VMS? As you might imagine, this is not straight forward. In most cases, we just extract flat CSV data files from the various applications and compare those files. There a few cases where we load the data from two applications into the same database (different tables), say Oracle, and compare the tables, using SQL queries or some other mean. But that's usually when we also need to run manual SQL queries to investigate further.

But in most cases, though, this is too inefficient (we have about 35 million customers and the extraction files commonly have tens of gigabytes, sometimes several hundreds of GB). It is most of the time much much faster to simply sort the data from both application according to the same sorting key (using for example the Unix sort utility), and to compare them line by line with Perl to detect duplicates, orphans (that's how we call a piece of data present in one application and not in the other) and discrepancies in the details. We've done a lot of benchmarking. In most cases, the complete comparison method described just above takes less time than just loading the data into a database and constructing the indexes, without having even started comparing anything.

I explained all this just because I wanted to let you know why I sometimes react strongly when someone suggest to use a database for such one-off processes. Sure, it is often easier when the data is small or medium, but for really large data, it simply scales much worse than the sort and compare strategy.

Now, of course, I admit that our activity might not be typical, especially because our activity does not imply persistent data. We compare different databases at a given point in time once a week, or once a month, or according to some other calendars, but we don't need to keep data from one time to another (well, this is sort of a lie or oversimplification, because when we do a comparison, we most of the time compare its results with the results of the previous identical comparison, in order to check that the corrections we submitted previously were duly applied, or, if they were not, to investigate why not).

Again sorry if I overreacted, I hope that my description of my job gave you the reasons for it.

Replies are listed 'Best First'.
Re^5: Finding Minimum Value
by BrowserUk (Patriarch) on Sep 24, 2014 at 01:44 UTC
    In most cases, the complete comparison method described just above takes less time than just loading the data into a database and constructing the indexes, without having even started comparing anything.

    Ah! Someone who understands to pick up the baton :)

    As (seemingly) the lone voice making that exact point for the 10+ years, it's good to see.

    Warning: be careful with your examples. erix has had the delightful habit of proving me wrong if my 'slow DB' examples are poorly chosen.

    Also be(a)ware of Oracle's latest hardware/DB monsters. Not that I would ever use anything that company makes -- assuming I had any influence -- but if you can afford it (the lock-in as much as the price), Oracle's latest hardware/in-memory DB is capable of some pretty amazing performance as they recently demonstrated to an old client of mine.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.