We had a run of duplicate submissions recently. Some of these are from people who've been around for a while, so I suspect that better documentation isn't the complete answer, and that some sort of mechanism is needed.

In thinking about the problem of duplicate submissions, I first though along the lines of doing some sort of textual analysis of a submission against the most recent N submissions. The simplest analysis would be a simple "eq" comparison. A more sophisticated would be to allow for a small number of differences, to catch people who post once, then backup to correct a typo and post again.

But I wonder now if a simpler check might be sufficient. On submitting a new, non-reply node, a quick check of the N most recent titles, for some small N (like 1, 2, or 3), would have caught the majority of the duplicate posts made in the last month.

If/when I accentally double-submit, I'd rather be told

You posted a node with the same title 47 seconds ago.
than discover that I'd goofed after the fact.

Does this sound workable, or am I missing something?

Replies are listed 'Best First'.
Re: Double double Post post
by mowgli (Friar) on Feb 15, 2003 at 09:14 UTC

    I suspect that duplicate submissions are, to a considerable extent at least, are caused by people accidentally clicking on Submit twice, or two posts requests otherwise being generated and sent. A simple solution for this would be, I think, to add a hidden input field to the form containing a randomly-generated value that would then be used to determine whether that form had been submitted already or not. A hash function like MD5 or SHA-1 could (and should) be used for this; that way, it would be reasonable to assume at the very least that there would be no false rejections.

    The advantage of this approach would be that it would actually be possible to post nodes with the same title in rapid succession - this could very well be needed with comments, too, where node titles like "Re: Re: some question" could well belong to nodes with different contents even if they are posted only 47 seconds or so apart from each other.

    --
    mowgli

Re: Double double Post post
by Coruscate (Sexton) on Feb 15, 2003 at 09:19 UTC

    If a user does indeed try to stop the submission process in order to fix the node up a bit more, then it means that the first one posted is more likely to be less-approved of (by the author and others). In which case, wouldn't it be better for a double-submit to actually replace the first submission with the second?

    As for how to detect a double-submission, there are a few ways I can think of: do a database query based on author, title, and perhaps content length? It's the last part that makes me itch. You can't really match on the actual text, because it might change if the user modifies the node before re-posting. You can't just count on author and title, because the same person might post two replies within the same thread in a short period of time. Also, anonymous monk replies would be harder to detect as well.

    So I propose... why not just another form value (a simple auto-incrementing value in the database. Once a node has been posted with that integer value, don't allow the same form value to be posted. Still, this will frustrate the heck out of those who are simply trying to fix a mistake or two (spelling, grammer, code errors, etc etc). So another proposal: why not also allow editting of all root nodes? I'm sure that such power wouldn't be abused... in fact, it might make things look nicer around here :)

    On a final note, I do agree that over the last week or so, there have been way too many duplicate posts. I imagine we are cluttering the database up with many unnecessary reaped nodes. Sometimes I wonder why the 'reaped nodes' idea was invented, rather than the 'delete the actual database entry'. I'm not completely stupid: I'm sure there's a reason the node contents are still kept around... is there a chance that anybody would enlighten me on this subject?


    Update: It seems that mowgli touched on most of the same thoughts as I presented :)


    If the above content is missing any vital points or you feel that any of the information is misleading, incorrect or irrelevant, please feel free to downvote the post. At the same time, reply to this node or /msg me to tell me what is wrong with the post, so that I may update the node to the best of my ability. If you do not inform me as to why the post deserved a downvote, your vote does not have any significance and will be disregarded.

Re: Double double Post post
by robartes (Priest) on Feb 15, 2003 at 12:31 UTC
    For one class of duplicate posts, the 'hitting submit twice' post, a hash / MAC based system, as suggested by mowgli might indeed work, and a text analysis of some kind might even catch slightly deviating double posts, but it will be up to the pmdevils to judge the validity of this approach, especially when one takes the extra DB and CPU load into account.

    There is a second class of duplicates though, as evidenced by this, it's cousin, his brother and their father, conveniently posted by anonymonk, so an analysis based on author to find duplicates is out of the window. This kind of duplicate post cannot be attributed to hitting submit twice, but seems to be simply a product of the 'give me the solution, and give it to me quick' mentality and the slightly misguided idea that repetition and restatement are replacements for patience and restraint. I think this type of duplicate posts are a bigger problem than the occasional accidental double post.

    To a certain extent, this type of post can be dealt with by the standard voting and consideration systems, but as most of them seem to be by anonymonks (for a good reason, perhaps), that only goes a part of the way to dealing with them. Unfortunately, I don't quite see a solution to it, except to grin and bear it. Ignoring these posts seems only to produce more of them...

    Do any other monks have any thoughts on this?

    CU
    Robartes-

Re: Double double Post post
by impossiblerobot (Deacon) on Feb 15, 2003 at 13:41 UTC

    We also get duplicate posts because of new visitors who don't understand the node approval system, and re-post because their question doesn't show up in the appropriate section (or on the front page) immediately.

    Unfortunately, I can't think of a good solution to this particular problem. There's no way to automatically check for this, and most visitors to an online forum don't lurk long enough to find out how a particular site works.


    Impossible Robot
      We also get duplicate posts because of new visitors who don't understand the node approval system, and re-post because their question doesn't show up in the appropriate section (or on the front page) immediately.

      An opportunity for education. The AnonyMonk doesn't see his or her initial post, then tries again. If they use the same title, they would see something like:

      You submitted a node with that title 1 minute ago, but it isn't visible because it hasn't been approved yet. All new submissions must be approved before they're visible. This usually doesn't take too long unless the submission is off-topic, offensive, or incomprehensible. Please be patient, and check back later.
Re: Double double Post post
by Ryszard (Priest) on Feb 15, 2003 at 11:17 UTC
    I think that idea could be workable idea as long as the match is constrained to the same user.

    Automation is generally better than documentation, you only have to get automation right once, you've got to get manual processes correct every time.

Re: Double double Post post
by demerphq (Chancellor) on Feb 16, 2003 at 01:44 UTC

    or am I missing something?

    Well, you are missing that there is some form of duplicate checking already in place. I haven't reviewed the code responsible but a little experimentation shows you can't submit the same node contents with the same node title twice.

    This means that the duplicates are not really duplicates at all. Which makes trapping them a lot harder. Perhaps some kind of normalizing could be applied so that a nearly dupes caught and a warning with override generated. Something like striping out non alpha numerics and space, lowercase the lot might work, I think that whatever technique used probably needs to be pretty fast though. So long as the user could still force the submission through a fuzzy matcher wouldn't be that bad.

    However even that isn't going to stop either dupes that I did recently. In one I didn't relaize I had hit submit then back and added a paragraph and then submit again. The other was when I did the same but changed the title. How do you stop that happening accidentally when it is quite possible that someone could post the same node in two different threads?

    Ultimately I think PM already has simple dupe blocking, and the utility of making it noticibly smarter is outweighed by diminishing returns. Doesn't mean I wont have a nosey around the source when I next get a chance, but you can see my point. :-)

    ---
    demerphq


Re: Double double Post post
by grantm (Parson) on Feb 15, 2003 at 20:21 UTC

    I also like mowgli's suggestion of a unique id in a hidden form field. One common case it wouldn't catch though is when someone posts; realises they weren't logged in; log in; and repost under their real name. Perhaps the preview screen could ask anonymous monks if they wish to log in before posting?