in reply to Catching Cheaters and Saving Memory
Given those numbers, it's time to be a little smart about what you need to know.
Since your looking for cheaters, your first task is to decide at what level a mutually beneficial voting pattern can be attributed to cheating rather than coincidence. For example, if two users have voted once each, in the only thread they have both posted to, your probably not going to divine that as a pattern of malicious intent.
With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts. If you set a threshold for the minimum number of votes a user has to have cast before you will start vetting them for cheating--say 20 votes?--then make your first pass of the data accumlating votes against users. This will require a hash of 1 million numerical scalar values, around 50 MB.
You then delete any entries in the hash that have less than your minimum_votes_cast threshhold. Set at 20, this is likely to discard 80 or 90% of the users. You can the make your second pass accumulating a hash of pairs showing for whom each of the qualifying voters cast their votes. As suggested elsewhere, if you use the pairs of userids as the keys in a single hash and skipping any voters that do not still exist in your first pass hash, then this will likely takes less space than the original hash before the discard step.
It might look something like this (untested):
use constant { POSTED => 0, VOTED => 1, RATIO => 2 }; ## Accumulate users only if they voted, and count their votes. my %users; open BIGFILE; while( <BIGFILE> ) { my( $user, $thread, $voted ) = split; ++$users{ $user } if $voted; } close BIGFILE; ## Discard all users who have voted less times than a sensible thresho +ld. $users{ $_ } < MIN_VOTES_THRESHOLD and delete $users{ $_ } for keys %u +sers; ## Re-scan the file, accumulating counts of posts and votes ## *for those userids remaining in %users only* ## Assumes file ordered by threadid. my %pairs; open BIGFILE; my( $user, $thread, $voted ) = split ' ', <BIGFILE>; my $lastThread = $thread; MAINLOOP: while( 1 ) { my @users; while( $thread == $lastThread ) { ## Accumulate users/votes in each thread push @users, "$user:$voted"; ( $user, $thread, $voted ) = split ' ', <BIGFILE>; last MAINLOOP if eof( BIGFILE ); }; $lastThread = $thread; ## Permute them to generate pairs for my $pair ( Cnr 2, @users ) { my( $user1, $voted1 ) = split ':', $pair->[ 0 ]; my( $user2, $voted2 ) = split ':', $pair->[ 1 ]; ## Skip if either is not in the 'high voters' list next unless exists $users{ $user1 } and exists $users{ $user2 +}; ## Otherwise increment the coincident pair count (and vote if +applicable). my $pair = pack 'LL', $user1,$user2; ++$pairs{ $pair }[ POSTED ]; ++$pairs{ $pair }[ VOTED ] if $voted1; } } ## Scan the pairs generating a ratio of votes to posts. my( $totalRatio, $maxRatio ) = ( 0 ) x 2; for ( keys %pairs ) { my $pair = $pairs{ $_ }; $pair->[ RATIO ] = ( $pair->[ VOTED ]||0 ) / $pair->[ POSTED ]; $totalRatio += $pair->[ RATIO ]; $maxRatio = $pair->[ RATIO ] if $maxRatio < $pair->[ RATIO ]; } ## The average ratio of pairwise votes to posts might form the basis f +or discrimination my $averageRatio = $totalRatio / ( keys %pairs||1 ); printf "The voted/posted ratios averaged to %f; with a maximum of %f\n +", $averageRatio, $maxRatio; ## Display those pairs with a vote/post ratio above teh threshold. $pairs{ $_ }[ RATIO ] > POST_VOTE_THRESHOLD and print "Pair @{[ unpack 'LL', $_ ]} had a vote/post ratio of @{ + $pairs{ $_ } }[ POSTED, VOTED, RATIO ]" for keys %pairs;
On my hardware, the two passes would take around 20 hours. The memory consumed will depend upon the level at which you set MIN_VOTES_THRESHOLD, but should be under 300 MB if you set it at a reasonable level.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Catching Cheaters and Saving Memory
by pengvado (Acolyte) on Oct 16, 2006 at 17:24 UTC |