Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

the -other- shuffling question

by mstone (Deacon)
on Jun 13, 2005 at 02:55 UTC ( [id://466040]=perlquestion: print w/replies, xml ) Need Help??

mstone has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone know of a good analysis on shuffled random number generators?

I'm not talking about things like the Fisher-Yates shuffle. I'm talking about the idea mentioned -- almost as an afterthought -- in Seminumerical Algorithms: you seed the output of an RNG into an array, then use the RNG's sequence to select and replace items from the array. It does wonders for the RNG's period, as well as improving its spectral behavior.

Sample code is below the fold.

To demonstrate the idea, let's start with a crappy little LCRNG whose behavior is easy to see:

my $M = 8; my $A = 5; my $X = 0; sub lcrng { $X = (($X * $A) + 1) % $M; return ($X); }

Its period is 8, and the output sequence is: 1 6 7 4 5 2 3 0. Larger generators work exactly the same way, they just use bigger values for $M and $A.

LCRNGs are the workhorses of non-crypto random number generation, but they're too weak for applications that need to be secure. It's too easy to extract $M and $A from a sequence of output values. If you take the items N at a time and graph them in an N-dimensional space, the resulting points tend to fall 'on the lines'. The two-dimensional graph for the sequence above looks like this, for instance:

+---------------+ | + | | + | | + | | +| | + | | + | |+ | | + | +---------------+

To shuffle the sequence, we feed it through an array like so:

my $Y = 0; my @T = (0) x $M; sub shuffle { ($Y, $T[$Y]) = ($T[$Y], lcrng()); return ($Y); }

The double-assignment there is basically a one-step Fisher-Yates shuffle.. we replace $Y with an item from the un-shuffled part of @T (i.e.: all of it), and replace the selected item with a new value from lcrng().

This new sequence performs.. quite a bit better than lcrng(). There's a little turbulence as the array loads itself, then the output runs about 2200 numbers before settling into a periodic sequence 60,440 numbers long. And here's the graph for 16 numbers from the periodic sequence:

+---------------+ | + +| |+ + + | | + + | | + + | | + +| | + + | |+ | | + + | +---------------+

It's much less tidy than the previous version.

If we increase $M to 16, the period of the repeating sequence weighs in at 55,470,480. Not bad for what are arguably two of the worst RNGs out there.

Thing is, those numbers fall far below the theoretical capacity of the system. We have $M possible values coming out of lcrng(). We have $M stored values in the array, and another $M possible values stored in $Y. In theory, that means there are $M^($M+1) possible configurations for the system. For $M=8, that means 8^9 (2^27) (134,217,728) is the longest possible period, and for $M=16 it's 16^17 (2^68) (2.95e20).

Obviously, there are configurations this version can never reach. Given the short period of lcrng() for $M=8, there's no way to reach the configuration:

$Y == 1 @T == ( 1, 1, 1, 1, 1, 1, 1, 1 )

and there will also be inefficiencies as some configurations fall into orderly patterns which don't use the entire array.

Thing is, I've only been able to find a handful of papers that even mention shuffling, and none of those offer anything in the way of analysis. Given the huge boost to performance and the low cost of implementation, I should think someone has devoted some head-time to the subject.

Does anyone know of any resources on the subject, or is this just one of the dead-ends that no one has ever bothered to consider?

Replies are listed 'Best First'.
Re: the -other- shuffling question
by kaif (Friar) on Jun 13, 2005 at 04:07 UTC

    I personally have not heard of this approach to "extending" random number generators. However, it is possible that this extension is not used because it is unclear whether or not many of the "randomness" properties hold using this extension.

    For example, is the resulting sequence "well-distributed"? In the case of the transformed $M=8 sequence, it is. Are consecutive pairs of the resulting sequence well-distributed? In the case of the same sequence, no they are not (for example, two identical consecutive digits is 20% less likely than it should be). In this particular case, this is perhaps expected, because the original sequence didn't have this property. But in general, if we take a sequence that has some "good randomness property", will the transformed sequence have the property too? In short, period length isn't always the only thing to consider.

      As I understand it, shuffling doesn't decrease any desirable qualities. It only leaves them unchanged or increases them. It's vaguely like a sorting algorithm that preserves existing order.. sorting on B then sorting on A gives you a sequence sorted with A as the major key and B as the minor key. In this case, shuffling took a sequence that had no consecutive doubles and introduced a few.

      If there are weaknesses, I'd like to know about those, too. I've found that some values for array size perform very badly (@T=7 leads to a period of 424, despite being relatively prime to $M), and adding a second shuffle doesn't seem to do much of anything. And clocking the generator:

      $M = 8; $A = 5; $X = 0; $C = 0; sub lcrng { $X = (($X * $A) + 1) % $M; $C++; if ($C = $M-1) { $C = 0; $X = lcrng(); } return ($X); }

      frankly sucks.

      This all follows Knuth's admonition that RNGs are delicate creatures, and tend to be sensitive to 'minor' adjustments. I'd like to learn the theory behind these things so I know what's safe to touch.

Re: the -other- shuffling question
by jdhedden (Deacon) on Jun 13, 2005 at 13:15 UTC
    What you want is the Mersenne Twister. I've had an interest in RNGs for years, and this is the best I've seen (and probably will be the best for some time to come). There are two CPAN modules that implement it: Math::Random::MT (which appears to be the lastest implementation), and Rand::MersenneTwister.

    With Math::Random::MT, the best starting seed would consist of 624 long ints. You can use Net::Random to get the random starting seed from Random.org using the URL:
    http://www.random.org/cgi-bin/randbyte?nbytes=2496&format=h

    There's always one more bug.

      I'm familiar with the MT.. I have Matsumoto & Nishmura's paper on it right here, in fact. Yes, it has a huge period (on the order of 2^19937), its spectral distribution is good up to 263 dimensions, and the analysis looks great. But its mechanism is totally different from the one I'm talking about here.

      I'm not looking for a long-period RNG per se. If I want one of those, yeah, I'll probably use the Twister. The itch I'm trying to scratch here is theoretical. This particular machine strikes me as interesting, and I want to learn what makes it tick.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://466040]
Approved by davidrw
Front-paged by neniro
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (1)
As of 2024-04-19 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found