in reply to Database Clean up?

The search criteria would need to be refined further. We'd want to dump users that haven't posted anything since we don't want nodes floating around whose authors no longer exist -- assuming this could even happen without violating foreign key contraints. But I don't think it's uncommon that people create an account in order to ask a question, and then leave.

We'd also have to choose a somewhat arbitrary cutoff date for delinquent users; if the creation date and last-here date match but occurred last week, dropping that user would be uncalled for. More likely we'd want to look at users that haven't visited in over a year or so.

Looking at users that have never created a node, and whose creation/last-here date occurred over a year ago would (I *speculate*) result in a rather short list. We might have a few hundred users but that's a trivial amount to store in a database, unlike your situation.

As to the safety issue of altering a production system, adding and deleting entries is just normal use for a database -- it's not like we're dropping tables or changing schemas, so I truly doubt that any havoc would result.

Update: Cool link theorbtwo. That's more users than I'd expected, although compared to the number of writeups (156786, not counting Q&A presumeably), even say 5,000 or so nonparticipating users is unlikely to be a drain on the system.

Replies are listed 'Best First'.
Re: Re: Database Clean up?
by theorbtwo (Prior) on Aug 24, 2002 at 03:40 UTC

    Actualy, it's not as trivial a number of users as you'd think. There are 3,833 users who have never logged in, not even once. There are an additional 9,668 who have logged in, but have no nodes. (These numbers are straight off of the frontpage of Crack for Statisticians.) Even just the never-logged-ins are 20% of all users. (BTW, the second number is "Logged In Once, No Write Ups", but I'd be careful with it -- number-of-logins isn't stored, only last-logged-in, and the number-of-writeups is often wrong, and doesn't count Q&A nodes, IIRC.)


    Confession: It does an Immortal Body good.

      Thanks for the numbers.

      Lets assume that each user takes up 1K, or at least that much when all the overhead of there record is included.
      use strict; my $never_logged_in = 3_833; my $logged_in_no_nodes = 9_668; my $aka_bad_users = $never_logged_in + $logged_in_no_nodes; my $disk_bytes = 1_024 * $aka_bad_users; print "Extra records to wade through = $aka_bad_users\n"; print "Wasted disk space = $disk_bytes\n";
      It appears to be around 13MB of space wasted by having "bad users" on file, I don't find that number large enough to introduce any headaches at this time, but I think it does deserve some thought on how to deal with these records in the future. Even if the removal was done for just the space considerations on the HD I don't think it would generate any great savings. There is also the option of using some form of compressed table format if we don't already, this would make all of the records more space efficent. One future plan might be to just add a column to the user table such as 'is_active' as a key so that searches could be done using it in the where critera where only "active" users are important to the results. A user could be considered inactive until the first login and then be switched active. Inactivity could be triggered by no login for over X months.

      I have only dug around slightly in the Everything engine and I think removing node association may be a risky if not deadly exercise.

      A properly indexed database table can side step the performance impact of numerous "extra" entries in a database.