I would like to offer different solution which may be completely off-topic. Since you are dealing with large datasets, IMHO, SQL is the perfect tool here. Moreover, any decent SQL backends are optimized to deal with memory issues as well as the efficiency of SQL queries. For illustration I will use PostgreSQL syntax. But this will generally applies to others as well.

Supose you created database test. I would start by creating one table as in:

% cat <<_EOC | psql test CRAETE TABLE rawdata ( uid INTEGER, thread_id INTEGER, voted INTEGER DEFAULT 0 ); _EOC

Then I would populate it like this:

% perl -nle 'printf "INSERT INTO rawdata VALUES (%d, %d, %d);\n", spli +t' < data.txt | psql test

Now, to the fun part:

% cat <<_EOSQL | psql test -- Use transactions so all temporary views are distroyed after rollbac +k. BEGIN TRANSACTION; -- Create view with voting history per account CREATE VIEW vote_histogram AS SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count FROM rawdata AS t1, rawdata AS t2 WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid GROUP BY t1.uid, t2.uid ORDER BY t1.uid; -- Crate view with suspected accounts with votes for others greater th +en set threshold CREATE VIEW suspects AS SELECT uid, voted_for, count FROM vote_histogram WHERE count > $THRESHOLD; -- Build crossreferenced view of accounts who votes for each other ver +y often SELECT s.uid AS suspect, h.uid AS voted_for, s.count AS suspect_votes, + h.count AS returned_votes FROM suspects AS s, vote_histogram AS h WHERE s.voted_for = h.uid AND s.count <= h.count ORDER BY s.uid, s.count DESC; ROLLBACK; _EOSQL

Voila! You got your suspects in a nicely formated table. Of course, you can limit the number of rows displayed as well as adjust your THRESHOLD parameter. I wrapped-up SQL statements in transaction so all the temporary objects are destroyed automatically once you are finished process. However, if you have a lot of space available, you can let them live for a bit more so you can construct different queries on them.

BR

In reply to Re: Catching Cheaters and Saving Memory by caelifer
in thread Catching Cheaters and Saving Memory by hgolden

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.