comment on

If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, ...

No. I hadn't seen your post at that point. It came from

CREATE VIEW vote_histogram AS
  SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count
    FROM rawdata AS t1, rawdata AS t2
    WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid
    GROUP BY t1.uid, t2.uid
    ORDER BY t1.uid;
[download]

The view contains 3 4-byte fields: uid INTEGER, voted_for INTEGER count INTEGER

So a full cross-product would be ~ 1e6 * 1e6 * ( 3* 4-bytes) = 12 Terabytes.

The where clause will limit the number of rows appearing in the final view, but along the way, (as you point out), most of that 12 TB has to be generated. You're right, depending how well spec'd the server is for memory, it would be possible to perform much of the reduction in ram, but on typical Intel box with a memory limit of 2 or 3GB, there would be a substantial cost in repeated effort doing it in 2GB chunks, as each chunk will require another pass of the raw data.

It will depend upon how the server is configured, but on smaller hardware I could well see the server attempting to trade diskspace for extra passes.

Either way, it would require some very seriously specified (expensive) hardware to be able to run that create view in anything like a reasonable amount of time. Your rent-a-cluster idea would probably be the only option for many small and medium sized businesses.

That said, databases aren't magic, and this problem is one which would show how non-magical they are.

Sit down and brace yourself. We agree :)

Probably my biggest problem with SQL is that it allows people to write apparently simple and straight forward solutions to problems, without being made aware of the scale of the task they encapsulate. Of course, that's good for DB vendors cos when the query takes forever to run, and you call them for advice, they get to (try to) upgrade you to an distributed solution, for which they charge even bigger bucks.

Years ago, before the dominance of the RDBMS, I trialed a new product from DEC called (from memory) DataEase. It was a TUI (terminal UI) build-a-query-by-example thing, backed by (I think) a hierachical DB. One of several nice things about it was that when you'd built your query it went through various steps prompting you to re-order it for efficiency. As a part of that process, it would display a guessimate of how long the query would take to run.

One memorable example of using this (that I think I've mentioned here before), was a query that joined over 7 tables. By simply reversing the order of the joins, the runtime fell from an estimated 69 hours to 5 or 6. The difference being the new ordering meant that the data on disk was process mostly sequencially, rather than leaping all over the diskpack. Modern optimisers probably do this kind of thing automatically, but in some ways that is a shame. Having the numbers shown to you and seeing the effect that various apparently minor changes had upon them was a salient learning experience.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^4: Catching Cheaters and Saving Memory by BrowserUk
in thread Catching Cheaters and Saving Memory by hgolden

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.