Dear monks,

I'm in the early stages of designing a system for collecting data from various disparate sources, processing it, and storing the results in a RDBMS (most likely PostgreSQL, but possibly MySQL). I'm very new to the brave new world of RDBs; I'd appreciate your comments on the following.

The primary sources for this system (which are entirely outside our control) are updated frequently, so we want to have regular (monthly?) updates for our derived database.

I'm debating between two different approaches to this updating. One is to re-build the database from scratch once a month. The other is to do the updates incrementally.

For the first approach, which I find by far the simplest, I was thinking to have a Perl script build an intermediate SQL file, analogous to a DB dump, and then have Pg read that in (I've found that this two-stage method is much faster than building the database directly from a Perl script via DBI+DBD::Pg). The one concern I have about this is that, most of my tables have serial fields that have no intrinsic meaning, and whose sole purpose is to relate tables to each other. Re-building from scratch every month would replace all these fields with new values. In other words, the values of these fields would not be meaningful across builds. Would this be a problem down the line?

The second approach, incremental updates, I find formidable, to be honest. It sounds reasonable at first blush; after all, why rebuild an entire database when only 1% of it has changed since the last update? This reasoning would be on target if the updates were always additions, but that's not the case: updates will also include deletions and modifications. So an incremental update would have to compare each record in the old version with each record in the new one, and somehow detect when records have been deleted. I suppose there may be situations in which this approach, as painstaking as it sounds, may be faster than a clean rebuild, but whatever they are I can't see them in the horizon. Moreover, I don't know how to do this other than through a Perl script, which I expect to be much slower than the two-stage clean rebuild described above. But again, I'm very green at this, so please feel free to bonk me with the clue-by-four.

I realize my questions here are very vague. That's another symptom of my cluelessness on this.

the lowliest monk


In reply to RDB updates: rebuild from scratch vs. incremental by tlm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.