http://qs321.pair.com?node_id=120136

In light of the recent rash of duplicate posts I started thinking about autoreaping of duplicate nodes. It seems to be that it wouldn't be too hard that when a node is submitted to have it automatically diff'd against all recent (e.g. < 1 hr) posts by the same author (or maybe last 5 posts or something) and rejected if it's a duplicate. The node could be automagically reaped and would save the editors and moderators a bit of time as well as solve the problem (as happened in the example above) where there are different threads to the different duplicate posts.

Replies are listed 'Best First'.
Re: Auto-reaping of duplicates
by tommyw (Hermit) on Oct 20, 2001 at 00:41 UTC

    I'm sure this must have been suggested before, but I wasn't watching then...

    When a comment form is generated, included a "magic number" as a hidden field. The only condition being that the number is unique. Then track those numbers which are submitted.

    Depending on performance implications, either note the generation of the number, and strike it when the comment is submitted (and generate an error if the number is not found due to previously being struck), or simply record all those numbers submitted (and generate an error if the number has been previously seen). In either mechanism, flush entries from the cache after a certain time to keep it down to size.

(crazyinsomniac) Re: Auto-reaping of duplicates
by crazyinsomniac (Prior) on Oct 20, 2001 at 00:30 UTC
    Either automagically delete the reaped nodes, or at least prevent replies to the potential duplicates (as well as consideration), until a wise editor can determine what to do.

    We have plenty of developers now, and hopefully one of them will see this, and submit a patch...

    update: apparently I, among others, am a part of the pmdev as opposed the developers group, go figure... ;D

     
    ___crazyinsomniac_______________________________________
    Disclaimer: Don't blame. It came from inside the void

    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Re: Auto-reaping of duplicates
by demerphq (Chancellor) on Oct 20, 2001 at 16:42 UTC
    Well, if its exact duplicates you're worried about then why not set up a unique index that contained the MD5 checksum of a post and prevent them from ever being allowed in the DB in the first place? That would be pretty simple to calculate and very fast, and pretty low memory overhead as well (OTOH I havent looking into the Everything code). If it was configured to quietly ignore the dupes I would guess it would be an easy fix.

    Yves
    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      It's important to remember that MD5 produces a hash of the content. Just because two items produce the same checksum, doesn't mean their content is identical (if it did, then we would never need the infinite supply of monkeys, as there would only be 2^128 possible texts). It's almost a certainty, but not quite. It'd certainly be highly embarassing to block somebody's 5 page thesis, because it happened to have the same MD5 checksum as an already existing "me too!" post.

      So MD5 can be used as a first cut for uniqueness, but still has to be followed up with a more precise check if the checksums do turn out to be the same. Just in case...

          only be 2^128 possible texts

        only? Only!? Do you have any idea how big 2^128 is?
        2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 > 3*10^38

        Which is bigger than the number of cups of water in all the oceans (6*10^21)
        Bigger than the distance from one end of the universe to the other in inches...(2*10^28)
        Bigger than the volume of the sun in cubic inches...(8*10^31)
        Bigger than the area of the galaxy in square miles...(3*10^35)
        Approaching the number of atoms in our atmosphere.....(2*10^44)

        (from bignum)

        Perhaps a secondary check is in order, but I'd hardly use 'only' when talking about 2^128 hash buckets.

        Update:
        Ok, lets play with the numbers some more:

        Let's assume perlmonks has 300,000 nodes (3*105 ) and has 3*1038 buckets in its hashing algorithm. The ratio of nodes/buckets is 3*105 : 3*1038 or 1 : 1033.

        Now, consider this lottery where you pick six different numbers from 1-49. Get all six right and you win the jackpot. As the page above notes, the chances of winning with one ticket are:

        1 : 13,983,816 ( (49*48*47*46*45*44)/(6*5*4*3*2*1) ) or about:
        1 : 107

        Lets buy one ticket a week for four weeks... odds of winning *all* four lotteries with our four tickets are: 1 : (107)4 or 1 : 1028 .

        That *still* doesn't get you there... after winning your four lotteries, we'll take you to one of the new huge NFL stadiums being built, and you have to gamble all your winnings on picking a specific, randomly-chosen seat (1 : 105)

        So the chances of my next post colliding with a node already in the database (1:1033 ) are about the same as you winning four lotteries on four tickets, then picking the single correct seat out of a gigantic stadium (1 : 1028*105)

        -Blake

        Well I have quite a bit of trouble believing that two posts, with different authors and different names would generate the same MD5. I suppose its possible but I guess the post would have to be very very long indeed. I seriously doubt that its possible to get the same MD5 from different data when the data is small, especially as small as a post would be. But then I dont know the full workings of MD5...

        Of course however, the extra check is cheap so why not...

        :-)

        Yves
        --
        You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re: Auto-reaping of duplicates
by Zecho (Hermit) on Oct 24, 2001 at 05:36 UTC
    5+ delete votes with 0 edit or keep votes = autoreap! would be my suggestion.