in reply to Catching Cheaters and Saving Memory
I would like to offer different solution which may be completely off-topic. Since you are dealing with large datasets, IMHO, SQL is the perfect tool here. Moreover, any decent SQL backends are optimized to deal with memory issues as well as the efficiency of SQL queries. For illustration I will use PostgreSQL syntax. But this will generally applies to others as well.
Supose you created database test. I would start by creating one table as in:
% cat <<_EOC | psql test CRAETE TABLE rawdata ( uid INTEGER, thread_id INTEGER, voted INTEGER DEFAULT 0 ); _EOC
Then I would populate it like this:
% perl -nle 'printf "INSERT INTO rawdata VALUES (%d, %d, %d);\n", spli +t' < data.txt | psql test
Now, to the fun part:
% cat <<_EOSQL | psql test -- Use transactions so all temporary views are distroyed after rollbac +k. BEGIN TRANSACTION; -- Create view with voting history per account CREATE VIEW vote_histogram AS SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count FROM rawdata AS t1, rawdata AS t2 WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid GROUP BY t1.uid, t2.uid ORDER BY t1.uid; -- Crate view with suspected accounts with votes for others greater th +en set threshold CREATE VIEW suspects AS SELECT uid, voted_for, count FROM vote_histogram WHERE count > $THRESHOLD; -- Build crossreferenced view of accounts who votes for each other ver +y often SELECT s.uid AS suspect, h.uid AS voted_for, s.count AS suspect_votes, + h.count AS returned_votes FROM suspects AS s, vote_histogram AS h WHERE s.voted_for = h.uid AND s.count <= h.count ORDER BY s.uid, s.count DESC; ROLLBACK; _EOSQL
Voila! You got your suspects in a nicely formated table. Of course, you can limit the number of rows displayed as well as adjust your THRESHOLD parameter. I wrapped-up SQL statements in transaction so all the temporary objects are destroyed automatically once you are finished process. However, if you have a lot of space available, you can let them live for a bit more so you can construct different queries on them.
BR
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Catching Cheaters and Saving Memory
by BrowserUk (Patriarch) on Oct 13, 2006 at 08:58 UTC | |
by tilly (Archbishop) on Oct 13, 2006 at 18:32 UTC | |
by BrowserUk (Patriarch) on Oct 13, 2006 at 19:29 UTC | |
by tilly (Archbishop) on Oct 13, 2006 at 19:55 UTC | |
by caelifer (Scribe) on Oct 15, 2006 at 05:50 UTC | |
|