Re: Catching Cheaters and Saving Memory

Given those numbers, it's time to be a little smart about what you need to know.

Since your looking for cheaters, your first task is to decide at what level a mutually beneficial voting pattern can be attributed to cheating rather than coincidence. For example, if two users have voted once each, in the only thread they have both posted to, your probably not going to divine that as a pattern of malicious intent.

With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts. If you set a threshold for the minimum number of votes a user has to have cast before you will start vetting them for cheating--say 20 votes?--then make your first pass of the data accumlating votes against users. This will require a hash of 1 million numerical scalar values, around 50 MB.

You then delete any entries in the hash that have less than your minimum_votes_cast threshhold. Set at 20, this is likely to discard 80 or 90% of the users. You can the make your second pass accumulating a hash of pairs showing for whom each of the qualifying voters cast their votes. As suggested elsewhere, if you use the pairs of userids as the keys in a single hash and skipping any voters that do not still exist in your first pass hash, then this will likely takes less space than the original hash before the discard step.

It might look something like this (untested):

use constant { POSTED => 0, VOTED => 1, RATIO => 2 };

## Accumulate users only if they voted, and count their votes.
my %users;
open BIGFILE;
while( <BIGFILE> ) {
    my( $user, $thread, $voted ) = split;
    ++$users{ $user } if $voted;
}
close BIGFILE;
## Discard all users who have voted less times than a sensible thresho
+ld.
$users{ $_ } < MIN_VOTES_THRESHOLD and delete $users{ $_ } for keys %u
+sers;

## Re-scan the file, accumulating counts of posts and votes 
## *for those userids remaining in %users only*
## Assumes file ordered by threadid.

my %pairs;
open BIGFILE;
my( $user, $thread, $voted ) = split ' ', <BIGFILE>;
my $lastThread = $thread;

MAINLOOP:
while( 1 ) {
    my @users;
    while( $thread == $lastThread ) {
        ## Accumulate users/votes in each thread
        push @users, "$user:$voted";
        ( $user, $thread, $voted ) = split ' ', <BIGFILE>;
        last MAINLOOP if eof( BIGFILE );
    };
    $lastThread = $thread;

    ## Permute them to generate pairs
    for my $pair ( Cnr 2, @users ) {
        my( $user1, $voted1 ) = split ':', $pair->[ 0 ];
        my( $user2, $voted2 ) = split ':', $pair->[ 1 ];

        ## Skip if either is not in the 'high voters' list
        next unless exists $users{ $user1 } and exists $users{ $user2 
+};

        ## Otherwise increment the coincident pair count (and vote if 
+applicable).
        my $pair = pack 'LL', $user1,$user2;
        ++$pairs{ $pair }[ POSTED ];
        ++$pairs{ $pair }[ VOTED ] if $voted1;
    }
}

## Scan the pairs generating a ratio of votes to posts.
my( $totalRatio, $maxRatio ) = ( 0 ) x 2;
for ( keys %pairs ) {
    my $pair = $pairs{ $_ };
    $pair->[ RATIO ] = ( $pair->[ VOTED ]||0 ) / $pair->[ POSTED ];
    $totalRatio += $pair->[ RATIO ];
    $maxRatio = $pair->[ RATIO ] if $maxRatio < $pair->[ RATIO ];
}

## The average ratio of pairwise votes to posts might form the basis f
+or discrimination 
my $averageRatio = $totalRatio / ( keys %pairs||1 );

printf "The voted/posted ratios averaged to %f; with a maximum of %f\n
+",
    $averageRatio, $maxRatio;

## Display those pairs with a vote/post ratio above teh threshold.
$pairs{ $_ }[ RATIO ] > POST_VOTE_THRESHOLD
    and print "Pair @{[ unpack 'LL', $_ ]} had a vote/post ratio of @{
+ $pairs{ $_ } }[ POSTED, VOTED, RATIO ]"
    for keys %pairs;
[download]

On my hardware, the two passes would take around 20 hours. The memory consumed will depend upon the level at which you set MIN_VOTES_THRESHOLD, but should be under 300 MB if you set it at a reasonable level.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: Catching Cheaters and Saving Memory Select or Download Code

Replies are listed 'Best First'.
Re^2: Catching Cheaters and Saving Memory by pengvado (Acolyte) on Oct 16, 2006 at 17:24 UTC
With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts. Doesn't that come out to an average of 5000 posts per user? Which would be far above a sane minimum for cheating. That not only eliminates much of the filtering (although the distribution of posts per user might still be uneven enough to filter out some users), but it also brings the (non-filtered) number of hash entries up to anywhere between 5e6 (an even 5 posts per thread, and everyone is cheating) to 2.5e10 (an even 5 posts per thread, but no user pair is repeated) to 1e12 (some threads are sufficiently large, and everyone meets everyone else). You could reduce the memory requirements a little by running it in two passes. One to count votes, and the second to count coincident posts only for user pairs that voted for eachother enough times to be suspicious.	[reply]

Replies are listed 'Best First'.

Re^2: Catching Cheaters and Saving Memory
by pengvado (Acolyte) on Oct 16, 2006 at 17:24 UTC

With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts.

Doesn't that come out to an average of 5000 posts per user? Which would be far above a sane minimum for cheating.

That not only eliminates much of the filtering (although the distribution of posts per user might still be uneven enough to filter out some users), but it also brings the (non-filtered) number of hash entries up to anywhere between 5e6 (an even 5 posts per thread, and everyone is cheating) to 2.5e10 (an even 5 posts per thread, but no user pair is repeated) to 1e12 (some threads are sufficiently large, and everyone meets everyone else).

You could reduce the memory requirements a little by running it in two passes. One to count votes, and the second to count coincident posts only for user pairs that voted for eachother enough times to be suspicious.

[reply]