Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

MD5-based Unique Session ID Generator

by radiantmatrix (Parson)
on Aug 19, 2004 at 13:42 UTC ( [id://384287]=CUFP: print w/replies, xml ) Need Help??

This generates session IDs that are unique but of a specific length (32 hex digits), making it ideal for storage in databases. The ID is based on hostname, time, and some psuedo-random data. I've run a test with this to generate 50,000 IDs as fast as possible and check for collisions -- I didn't get any. (Not really a strong test, but good enough for most uses).

Update: I'm seeing a lot of people making the assumption that these are for Web Sessions, or similarly short-life-span situations. I should have been more clear. I'm using these for a database application that is exported to a proprietary application -- so Apache anything is useless for this. Also, some of these sessions involve loading several hundred GB of data and can stay open for days, so lowering collisions over a span of time is important.

Please scan through the replies for specific details.

use strict; use Digest::MD5; use Sys::Hostname; sub md5sid { my $hostname = hostname(); my $serial = new Digest::MD5; $serial->add($hostname); # Having the time in hex form is useful $serial->add(sprintf "%X", time()); # This is sort of slow, but strong. Reducing # the param for rand() will speed things, but # make collisions more likely. for (my $i=1; $i < rand(2345678); $i++ ) { $serial->add(chr(int(rand(223))+32)); } my $session = $serial->hexdigest(); return $session; }#-sub md5sid
The following alternate was suggested by simonm. I haven't personally tested it, but it seems like a Better Way.
use Data::UUID; use constant IDGenerator => Data::UUID->new(); sub new_sid { IDGenerator->create() }
See the Data::UUID module on CPAN.

Replies are listed 'Best First'.
Re: MD5-based Unique Session ID Generator
by stvn (Monsignor) on Aug 19, 2004 at 14:21 UTC

    I would think hostname is a pretty hefty operation for genarating a session id, I'm not sure but I think it does a DNS lookup.

    The ID is based on hostname, time, and some psuedo-random data. I've run a test with this to generate 50,000 IDs as fast as possible and check for collisions -- I didn't get any.

    I use this for session ids (which I took from one of the Apache::Session modules)

    use Digest::MD5; $session_id = substr(md5_hex(md5_hex(time() . {} . rand() . $$)), 0, 3 +2);
    I ran it within the same process over 100,000 times with no collisions.

    This is sort of slow, but strong. Reducing the param for rand() will speed things, but make collisions more likely.

    I am no crypto expert, but from what I know, Its not really any stronger than if you didn't do it this way. Using MD5 and different text each time, it is highly unlikely that you will find a collision actually, that is just the nature of MD5 and hashing algorithms in general.

    -stvn
      Just the middle part of your expression(time() . {} . rand() . $$)helps making the session-id's unique.

      pelagic

        Very true, but the double md5_hex() doesn't hurt (as far as I know).

        As I said, I am no crypto expert, and my knowledge of these things is limited. But I would think that hashing a reasonably unique string to produce a pretty darn close to unique string, and then hashing it again to get (what I would assume is) an even closer to truely unique string is a good thing when generating session ids. Please though, if I am wrong, and the double hash provides no benefit let me know why, as I would be interested in knowing.

        -stvn
      I hadn't thought of doubling the md5_hex operations -- nice tip, thank you.
      I am no crypto expert, but from what I know, Its not really any stronger than if you didn't do it this way. Using MD5 and different text each time, it is highly unlikely that you will find a collision actually, that is just the nature of MD5 and hashing algorithms in general.
      It's not MD5 use that causes issues -- it's the random data that one is hashing. If the text is always different, great -- but on systems with poor PRNG's (Win2k springs to mind), I have gotten MD5 collisions based on the fact that outputs weren't random enough - MD5 the same text twice, and you get the same digest each time. With the same algo above, except s/2345678/2345/, I had 11 collisions in 20,000 generated sessions. Not Good™.

      Again, though, I will have to try your much faster (and shorter) method and see if I get good results with a poor PRNG -- thanks!

        Again, though, I will have to try your much faster (and shorter) method and see if I get good results with a poor PRNG -- thanks!

        Just FYI, see my reply/discussion above with pelagic regarding the use of the added "{}". This bit of it may of may not provide any benefit.

        -stvn
Re: MD5-based Unique Session ID Generator
by dragonchild (Archbishop) on Aug 19, 2004 at 14:00 UTC
    How much stronger is this than md5_hex( time, $$, time ) where $$ is spread over 150 Apache child processes?

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      I haven't tested that case, but it really doens't apply to what this is used for. Please see my update note in the description...
Re: MD5-based Unique Session ID Generator
by simonm (Vicar) on Aug 19, 2004 at 16:20 UTC

    If you don't mind using 128 bits rather than 32, Data::UUID guarantees that you won't get duplicates, ever.

    use Data::UUID; use constant IDGenerator => Data::UUID->new(); sub new_sid { IDGenerator->create() }

    Update: Duh, they're both the same size, 128 bits and 32 hex digits.

      MD5 is 128 bits. It's 32 hex digits.

      I do prefer Data::UUID for this task, myself. Note that while it guarantees uniqueness, it doesn't guarantee unpredictibility, which may or may not be a problem for a given application. Most MD5/SHA1/whatever session generators have the same issue.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      If you don't mind using 128 bits rather than 32, Data::UUID guarantees that you won't get duplicates, ever.

      Thanks! I won't be able to test this immediately, but if it works (and it seems like it will), it will be most helpful.

      One point though, MD5 generates 32 hex digits representing 4 bits each - that's already 128 bits. Sorry if I was unclear about that.

      Data::UUID guarantees that you won't get duplicates, ever.

      While Data::UUID is a good solution, it doesn't guarantee that "you won't get duplicates, ever" (heck - there are only 128bits after all :-)

      As the docs say...

      A UUID is 128 bits long, and is guaranteed to be different from all other UUIDs/GUIDs generated until 3400 CE.

      ...

      It provides reasonably efficient and reliable framework for generating UUIDs and supports fairly high allocation rates -- 10 million per second per machine -- and therefore is suitable for identifying both extremely short-lived and very persistent objects on a given system as well as across the network.

      So, it wouldn't be suitable if you were coding something up for The Long Foundation - or needed to allocate UUIDs really, really, really quickly :-)

      The full gory detail can be found in this IETF draft.

      All this complexity is, of course, why I like Data::UUID. People who are experts have taken the time to look hard at the algorithm, and I can have some confidence in it working well.

        While Data::UUID is a good solution, it doesn't guarantee that "you won't get duplicates, ever" (heck - there are only 128bits after all :-) ... A UUID is 128 bits long, and is guaranteed to be different from all other UUIDs/GUIDs generated until 3400 CE.

        I didn't say "it will never produce duplicates" -- just that "YOU won't get duplicates" (unless you live for over a thousand years).

Re: MD5-based Unique Session ID Generator
by guha (Priest) on Aug 19, 2004 at 14:22 UTC

    I'm definitely not an expert on cryptos and related issues, but the loop looks suspicious in my eyes.

    Do you realize that you push anything between zero and 2 Mbytes through the MD5 routine, no wonder that it, sometimes i guess, takes time to generate a key.

      Do you realize that you push anything between zero and 2 Mbytes through the MD5 routine, no wonder that it, sometimes i guess, takes time to generate a key.

      Considering that, it's actually remarkably speedy. If I'm just generating one key at a time, it is basically instantaneous. 50,000 keys took about 5 minutes (not bad, all things considered). In the application I have (see Update, please), I'm not generating more than 50 keys in a given 1s interval, but they absolutely must not duplicate.

      Still, I will be exploring some of the tips in this thread regarding faster ways to accomplish the same thing; hopefully I will remember to update my snippet when I get around to testing them!

Re: MD5-based Unique Session ID Generator
by pelagic (Priest) on Aug 19, 2004 at 14:12 UTC
    Why do you use time 2 times in your list?
    It will be the same both times.

    pelagic
      Why do you use time 2 times in your list? It will be the same both times.

      I assume you are refering to dragonchild's code since the OP doesnt have time in there twice.

      It will not matter if the time is the same, the idea is to generate a (sorta) unique string, and it will do that. Once put through md5_hex, it wont much matter after that. MD5 will give you the true uniqueness, all you really need a a bit of entropy to get it started.

      -stvn
        To add "time" a second time does not make the string more unique than with just once "time".
        It makes the theoretical entropy higher but that's not the target here as we are not defending hackers. We just want to avoid collisions. The uniqeness of the id's must be achieved before feeding them through MD5.

        pelagic
        If we're talking about getting entropy, why don't we go with a better entropy source than the minor disparity between the two calls to time which at MOST will vary by one digit, which is not very entropic. Why don't you just call hotbits and grab some radioactive decay data in hex format, break it apart and loop over it to give us some real entropy. That WILL decidely minimize the chance of collisions. Since your already acting against data returned by Sys::Hostname, this should be right up the alley of what your doing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://384287]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-25 06:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found