Thus far in my Perl career, I’ve never had to worry about memory. Equipped with a nice machine at the office, and problems rich in interest but low in memory demands, I’ve been writing Perl for legibility, and occasionally for speed. But now my memory-loose ways are catching up with me.
I can’t talk about my problem domain, except to say that we’re trying to catch cheaters. However, I can explain the specific problem with an analogy to a fictional version of PerlMonks.
1. Everyone is an XP whore, and all votes are ++.
2. You can only vote on threads that you participated in, and you can only participate once in a given thread.
3. When you vote on a thread, everyone who posted there gets the ++.
4. Voting behavior is public.
Then we can figure out how often each monk votes ++, and we could identify pairs of monks who voted each other up with suspicious regularity. The data that I get looks like this:
UserID ThreadID Voted 2 3 1 5 3 0 17 3 1 8 3 1 9 4 1 5 4 0 23 4 0 2 4 1 3 4 1 43 4 1 8 5 0 ...
In this made-up data, there were four participants in thread 3, and three of them voted ++. There were 6 participants in thread 4, and four of them voted ++. A few weeks ago, I got a dataset like this that had about 3.8 million entries, spread across ~800K threads. The catch is that you really want to walk back and forth along the data (at least within each thread), so if it was a tiny dataset, I would have slurped it. Seeing as it was fairly large, I wrote a little script that goes through the data, and, basically, slurps each thread at a time, and then increments a pair of 2-D hashes.
I rewrote the program to conform to this fictional problem domain, so a typo might have slipped in, but I'm concernced with concepts.
#!/usr/bin/perl use warnings; use strict; open(IN, "data.txt"); my(%seen_each_other_count, %voted_each_other_count, $old_thread_id, @c +urrent_user_ids, @current_votes); my $newline=<IN>; while ($newline) { chomp($newline); (my $user_id, my $thread_id, my $vote) = split(/\t/, $newline); $old_thread_id=$thread_id; push @current_user_ids, $user_id; push @current_votes, $vote; $newline=<IN>; chomp($newline); ($user_id, $thread_id, $vote) = split(/\t/, $newline); while ($thread_id==$old_thread_id) { push @current_user_ids, $user_id; push @current_votes, $vote; $newline=<IN>; chomp($newline); ($user_id, $thread_id, $vote) = split(/\t/, $newline); } for (my $i=0; $i<=$#current_user_ids; $i++) { for (my $j=$i; $j<=$#current_user_ids; $j++) { #When $j==$i this just counts how many times you appear, a +nd how many times you vote $seen_each_other_count{$current_user_ids[$i]}{$current_use +r_ids[$j]}++; if ($current_votes[$i]==1 && $current_votes[$j]==1) { $voted_each_other_count{$current_user_ids[$i]}{$curren +t_user_ids[$j]}++; } } } $old_thread_id=$thread_id; @current_user_ids=(); @current_votes=(); push @current_user_ids, $user_id; push @current_votes, $vote; $newline=<IN>; chomp($newline); } #Since the data is read down, the total number of times that #one saw two is equal to the number of times that we counted one seein +g two, #plus the number of times we counted two seeing one sub numerically { $a <=> $b }; open(OUT, ">PMTEST.txt"); print OUT "user1\tuser2\tsaw_count\tvote_count\n"; foreach my $first (sort numerically (keys %seen_each_other_count)) { my $ref=$seen_each_other_count{$first}; foreach my $second (sort numerically (keys %$ref)) { my ($final_count, $final_votes); if ($first==$second) { $final_count=$seen_each_other_count{$first}{$second}; $final_votes=$voted_each_other_count{$first}{$second}; } else { $final_count=$seen_each_other_count{$first}{$second}+$seen +_each_other_count{$second}{$first}; $final_votes=$voted_each_other_count{$first}{$second}+$vot +ed_each_other_count{$second}{$first}; } $final_votes||=0; print OUT "$first\t$second\t$final_count\t$final_votes\n"; } }
That worked very well for this dataset, but my bosses so enjoyed the results, that they're getting a new dataset which, instead of being 3.8 million lines, is about 5 billion lines (and 55 gigabytes).
So, last time it was enough to not slurp the data, but now I'm worried that the hashes are going to blow up (have any of you had a million by million 2-D hash?). I've been pondering how to deal with this, and the first thing that I've come up with is to pick a number of users (say 100), and first just count which people those 100 users see and vote for. Then I can output that to my disk, and do the next 100 users.
Doing this in practice would mean only incrementing a hash entry if the first index is in the set of users I'm on, and skipping a thread if there's no user of interest in it. However, I'm worried that this will take a month.
Here's the issue: I don't have the data, but my bosses want the results ASAP once we do. That means I have to be ready with a couple of script to do this job, so that if one is taking too much memory, I can switch to another, and if the next is going to take a month, I can switch to a third.
When I read SOPW, I'm always impressed at the different and clever approaches that people take to solving problems. Do any of you have a different way of dealing with this that will be reasonably fast, and not take too much memory? I'm happy to work out the details of the code, I'm mostly looking for ideas.
Thanks so much,Hays
Update: One of the problems with not yet having the dataset is that I don't have the exact numbers on how many users and threads there are. I think a fair (i.e. leaning towards worst case) assumption is that those numbers scale linearly from the smaller dataset (the number of participants in a given thread is capped in my real problem domain). That would give ~1 million users and ~1 billion threads {plus or minus a few orders of magnitude :( }.
In reply to Catching Cheaters and Saving Memory by hgolden
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |