vili has asked for the wisdom of the Perl Monks concerning the following question:

Greetings o enlightened ones,
A new problem has presented itself, apparently some users, register more than once, and use that, to bypass restrictions, and abuse our site. I'd like to be able to score similarities in user data, by comparing the values per each userid. and assigning points for each match, or similarity. like spam assasin, for an instance if the secret question/answer combo matched that would be 2 points, or same password and similar username, and so forth. I'm not real sure, how to handle comparing the values of each userid to the rest, just sound very resource intensive, and clumsy. this give or take is the format that I'd like to use, but what next?
#!/usr/bin/perl + + use warnings; use strict; use DBI; + + my ($db,$sql,%shady,$query,$c,$count,$userid); + + $sql = "select userid, firstname, lastname, password, username, dob, z +ipcode, age, remotehost, ut, secret_question, secret_answer from user +s where status='ok' limit 5;"; $db = DBI->connect("DBI:mysql:database=my_user_table;host=db_server"," +mysqluser","",{RaiseError=>0}); if ( !$db ) { print "Cannot contact db_server, bypassing."; } $count=0; $query = $db->prepare($sql); $query->execute(); while ($c = $query->fetchrow_hashref) { $count++; $userid=$c->{"userid"}; $shady{$userid}->{firstname}=$c->{"firstname"}; $shady{$userid}->{lastname}=$c->{"lastname"}; $shady{$userid}->{password}=$c->{"password"}; $shady{$userid}->{username}=$c->{"username"}; $shady{$userid}->{dob}=$c->{"dob"}; $shady{$userid}->{zipcode}=$c->{"zipcode"}; $shady{$userid}->{age}=$c->{"age"}; $shady{$userid}->{remotehost}=$c->{'remotehost'}; $shady{$userid}->{ut}=$c->{"ut"}; $shady{$userid}->{secret_question}=$c->{"secret_question"}; $shady{$userid}->{secret_answer}=$c->{"secret_answer"}; } $db->disconnect();


Of course, I'd also be getting info from other tables, etc.. Any ideas, and/or suggestions would be appreciated.

Regards:
~vili
  • Comment on Identifying fraudulent users, by comparing values in database. with a hash..?
  • Download Code

Replies are listed 'Best First'.
Re: Identifying fraudulent users, by comparing values in database. with a hash..?
by Roger (Parson) on Sep 24, 2003 at 01:29 UTC
    Observation 1
    Comparing username/firstname/lastname might not be essential. Look, an abuser is not going to use the same user name (or the real user name) twice, right?

    Observation 2
    Comparing passwords could score +++. The hacker might use the same password for different accounts.

    Observation 3
    Secret question/answer - could work if these were hand typed and not selected from a dropped-down list. Again if the answers from two accounts match, then there is a big chance that it's a duplicate.

    Observation 4
    Age/zipcode/dob etc are irrelevant. As the hacker will most certainly conceal his/her identity.

    Observation 5
    Remote host - is this an IP number? It might be a good idea to compare this if two accounts have similar/same secret answer, similar/same password. See if two accounts are from the same sub-net, etc.

    Suggestions
    Checkout the CPAN module String::Similarity to compare two similar strings.

    Efficiency wise, this operation is at best a (O^2)/2. Most likely O^3 if you do additional table look up's. In otherwords its going to be process intensive. So it would be a good idea to buffer all the data before the compare. And try to avoid named hashes to store values, because they are relatively slow to look up. So if you want to speed up more, then use the pseudo-hashes instead.

    And most of all, good luck!
      Most likely O^3

      Oh, O(n^k) isn't so bad. Sure, it's a bit slow, but for reasonable values of n the process *will* finish. I presume this is only being run once per user, though n will be the number of users, which could be a bit on the high side. Still... might not be too bad. Computers are pretty fast these days.

      OTOH, the other day I wrote a script that was horrible. I knew it was brute force when I wrote it, but I only needed to run it a couple of times, and n was never going to exceed 15 or so, so I didn't worry about efficiency. For n=3 it ran fine. Took a minute, but I knew it was an inefficient implementation. So I set n to 4...

      Some of you may know where this is going. Windows Me told me I was running low on disk space, so I stopped the process and discovered that the swapfile was over 180GB. (It was a recursive algorithm...) So I analysed the algorithm, and it turned out that it was approximately O((n^2)!), and using an amount of RAM proportional to running time. Yeah, that's a factorial. Guess I have to come up with a slightly more clever algorithm.


      $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Identifying fraudulent users, by comparing values in database. with a hash..?
by eric256 (Parson) on Sep 24, 2003 at 02:09 UTC

    You might also wanna add a setup to require an email address. I know its not hard to have more than one email address, but add that with some form of email verification and it would make it harder to create duplicate accounts.


    ___________
    Eric Hodges
Re: Identifying fraudulent users, by comparing values in database. with a hash..?
by adrianh (Chancellor) on Sep 24, 2003 at 10:10 UTC

    I can't see what you're proposing working well. Abusers are going to lie. It's trivial to find valid names and addresses with little effort. Tools like spamassasin work because it is hard to send, for example, spam about credit cards without mentioning the words "credit card". If there is no reason to input meaningful data then it's not going to work well. Dealing with false positives may also annoy your valid users.

    Any ideas, and/or suggestions would be appreciated.

    Depends on what business you're trying to run.

    • One way of very quickly cutting down duplicates is to charge for registration.
    • Another way is to use some piece of unique data that is hard to duplicate or make up. Credit card numbers come to mind (you don't have to charge them money - just check that it validates with the card company).
    • Have a manual process. Have them have to appear in person to pickup login details. Have login details posted to their address.
    • Although it is also easy to duplicate, making them give a valid email address is another mechanism that may help cut things down.
Re: Identifying fraudulent users, by comparing values in database. with a hash..?
by vili (Monk) on Sep 24, 2003 at 17:29 UTC
    Thnaks to all that replied, there has been a change of plans. instead of using hashes, and trying to come up with a list of all the users with their fraud_factor, I'll just have the user specify an id, and compare it's values to all the rest of the userids, in the database, one row at a time. multiple tables will be queried. this is bound to slim the process down. (on a side note, I came to find out while trying to merge a months worth of apache logs that the intel limit for a process in ram is 2.8 gigs, with an increase expected, with next release to 3.6 gigs) I'm not short on criteria to check for. It is a comparison shopping/auction website, and one of the good signs that there is something fishy is that the sellers, have the same products in the same price range. we get email, and credit card, so that has alredy been taken care of on the registration portion of it. so if you registered you must have given a correct name/lastname/credit card info. so forth and so on. this is what I'll try to do:
    #!/usr/bin/perl use warnings; #use strict; use DBI; my ( ); #put my vaiables ^ there $testuserid=$ARGV[0]; $test_sql="select firstname, lastname, password, username, dob, zipco +de, age, remotehost, ut, secret_question, secret_answer from users wh +ere userid=$testuserid;"; $db_test = DBI->connect("DBI:mysql:database=my_userdb;host=db_host","m +aster","",{RaiseError=>0}); if ( !$db ) { $fraud{$testuserid}->{secret_question}=$d->{"secret_question" +}; print "Cannot contact aladdin, bypassing."; } $query_test = $db_test->prepare($test_sql); $query_test->execute(); while ($d = $query_test->fetchrow_hashref()) { # $testuserid=$d->{"testuserid"}; $test4firstname=$d->{"firstname"}; $test4lastname=$d->{"lastname"}; $test4password->{password}=$d->{"password"}; $test4username->{username}=$d->{"username"}; $test4dob=$d->{"dob"}; $test4zipcode=$d->{"zipcode"}; $test4age=$d->{"age"}; $test4remotehost=$d->{'remotehost'}; $test4ut=$d->{"ut"}; #cookie $test4secret_q=$d->{"secret_question"}; $test4secret_a=$d->{"secret_answer"}; $fraud_factor=0; # more queries, more variables #........................... #........................... } $db_test->disconnect(); $sql = "select userid, firstname, lastname, password, username, dob, z +ipcode, age, remotehost, ut, secret_question, secret_answer from user +s where status='ok' limit 5;"; $db = DBI->connect("DBI:mysql:database=my_userdb;host=mydbhost","maste +r","",{RaiseError=>0}); if ( !$db ) { print "Cannot contact aladdin, bypassing."; } $count=0; $query = $db->prepare($sql); $query->execute(); while ($c = $query->fetchrow_hashref) { $count++; $userid=$c->{"userid"}; $firstname=$c->{"firstname"}; $lastname=$c->{"lastname"}; $password=$c->{"password"}; $username=$c->{"username"}; $dob=$c->{"dob"}; $zipcode=$c->{"zipcode"}; $age=$c->{"age"}; $remotehost=$c->{'remotehost'}; $ut=$c->{"ut"}; $secret_q=$c->{"secret_question"}; $secret_a=$c->{"secret_answer"}; # more queries, more variables #........................... #........................... # do some creative tests, while incrementing # fraud_factor for combinations of matches } $db->disconnect(); print "$testuserid\t $fraud_factor";
    We have about 300k registered users, so this, won't be too bad, I hope.
    As always, your feedback is appreciated.
    Regards:
    ~vili