comment on

Given those numbers, it's time to be a little smart about what you need to know.

Since your looking for cheaters, your first task is to decide at what level a mutually beneficial voting pattern can be attributed to cheating rather than coincidence. For example, if two users have voted once each, in the only thread they have both posted to, your probably not going to divine that as a pattern of malicious intent.

With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts. If you set a threshold for the minimum number of votes a user has to have cast before you will start vetting them for cheating--say 20 votes?--then make your first pass of the data accumlating votes against users. This will require a hash of 1 million numerical scalar values, around 50 MB.

You then delete any entries in the hash that have less than your minimum_votes_cast threshhold. Set at 20, this is likely to discard 80 or 90% of the users. You can the make your second pass accumulating a hash of pairs showing for whom each of the qualifying voters cast their votes. As suggested elsewhere, if you use the pairs of userids as the keys in a single hash and skipping any voters that do not still exist in your first pass hash, then this will likely takes less space than the original hash before the discard step.

It might look something like this (untested):

use constant { POSTED => 0, VOTED => 1, RATIO => 2 };

## Accumulate users only if they voted, and count their votes.
my %users;
open BIGFILE;
while( <BIGFILE> ) {
    my( $user, $thread, $voted ) = split;
    ++$users{ $user } if $voted;
}
close BIGFILE;
## Discard all users who have voted less times than a sensible thresho
+ld.
$users{ $_ } < MIN_VOTES_THRESHOLD and delete $users{ $_ } for keys %u
+sers;

## Re-scan the file, accumulating counts of posts and votes 
## *for those userids remaining in %users only*
## Assumes file ordered by threadid.

my %pairs;
open BIGFILE;
my( $user, $thread, $voted ) = split ' ', <BIGFILE>;
my $lastThread = $thread;

MAINLOOP:
while( 1 ) {
    my @users;
    while( $thread == $lastThread ) {
        ## Accumulate users/votes in each thread
        push @users, "$user:$voted";
        ( $user, $thread, $voted ) = split ' ', <BIGFILE>;
        last MAINLOOP if eof( BIGFILE );
    };
    $lastThread = $thread;

    ## Permute them to generate pairs
    for my $pair ( Cnr 2, @users ) {
        my( $user1, $voted1 ) = split ':', $pair->[ 0 ];
        my( $user2, $voted2 ) = split ':', $pair->[ 1 ];

        ## Skip if either is not in the 'high voters' list
        next unless exists $users{ $user1 } and exists $users{ $user2 
+};

        ## Otherwise increment the coincident pair count (and vote if 
+applicable).
        my $pair = pack 'LL', $user1,$user2;
        ++$pairs{ $pair }[ POSTED ];
        ++$pairs{ $pair }[ VOTED ] if $voted1;
    }
}

## Scan the pairs generating a ratio of votes to posts.
my( $totalRatio, $maxRatio ) = ( 0 ) x 2;
for ( keys %pairs ) {
    my $pair = $pairs{ $_ };
    $pair->[ RATIO ] = ( $pair->[ VOTED ]||0 ) / $pair->[ POSTED ];
    $totalRatio += $pair->[ RATIO ];
    $maxRatio = $pair->[ RATIO ] if $maxRatio < $pair->[ RATIO ];
}

## The average ratio of pairwise votes to posts might form the basis f
+or discrimination 
my $averageRatio = $totalRatio / ( keys %pairs||1 );

printf "The voted/posted ratios averaged to %f; with a maximum of %f\n
+",
    $averageRatio, $maxRatio;

## Display those pairs with a vote/post ratio above teh threshold.
$pairs{ $_ }[ RATIO ] > POST_VOTE_THRESHOLD
    and print "Pair @{[ unpack 'LL', $_ ]} had a vote/post ratio of @{
+ $pairs{ $_ } }[ POSTED, VOTED, RATIO ]"
    for keys %pairs;
[download]

On my hardware, the two passes would take around 20 hours. The memory consumed will depend upon the level at which you set MIN_VOTES_THRESHOLD, but should be under 300 MB if you set it at a reasonable level.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Catching Cheaters and Saving Memory by BrowserUk
in thread Catching Cheaters and Saving Memory by hgolden

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.