http://qs321.pair.com?node_id=1131315


in reply to Random shuffling

The "element" I am identifying in the genome has two ends. These two ends are each around 50-100 long. The separation between these ends is between 200 and 20,000. The length of the ends, and the separation between them is NOT known a priori. So, I wonder if there are two size ranges or periodicities whose non randomness needs to be broken down, with two successive shuffles, one in the ~ 100 length window and again in a ~1000 length window? Or some other window lengths?

No.

Let's start with the minimum length:

50 + 200 + 50. There are 300! = 3.0605751221644063603537046129727e+614 possible outcomes from the shuffle.

For one of your 50 char headers or trailers to have survived the shuffle intact -- ie, to either have remained where it was or to have been 'reconstructed' at some other position the calculation goes something like: at any given position, there are 50! = 3.0414093201713378043612608166065e+64 possible ways that the original 50 chars could have been rearranged; which makes the probability that an exact 50 sequence re-appears somewhere after one shuffle, something like 50! / 300! = 9.9373784297785355338568640895281e-551. Which is as close to 'impossible' as you could wish for.

And that's for the shortest lengths. For your average lengths, you are talking all the computers that have ever existed, do exist and will ever exist, processing flat out until Universal heat death.

Of course; it could also happen tomorrow -- that's the nature of random -- but duplicating the shuffle doesn't make it any more or less likely to happen.

understand any differences between List::Util qw(shuffle) and Array::Shuffle qw(shuffle_array)

They both use the Fischer-Yates method. But ...

List::Util::shuffle takes the array, flattens it to a list on the stack, shuffles that list on the stack and returns it; where it is then assigned back to an array.

Array::Shuffle takes a reference to an array and shuffles that in-place.

The latter is faster because it avoids the array to list & list to array conversions at either end.

Either -- over a suitable sample size -- will produce similar results. The latter gets you there more quickly.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Replies are listed 'Best First'.
Re^2: Random shuffling
by Anonymous Monk on Jun 21, 2015 at 02:31 UTC

    Thank you for that clarification. I think I now realize I might have missed another point that is also important, so please look out for UPDATE 2 under original post.

    .
      UPDATE 2 The identification of the two ends is NOT deterministic, but based on finding a pattern, where each character in the pattern could be A or T or G or C or some combination of those 2, 3 or all 4 letters. when you search for a perfect match, and when even 1 character gets randomly shuffled you cannot detect it any more, so that is straight forward, right? But here we are NOT looking for JUST perfect matches, but there exists degeneracy in each matching position, so that when you shuffle out one letter and replace it with another, it might still be a match, albeit a different one that "scores" better or worse - that makes it a bit more complicated IMO, doesn't it? This is the actual scenario. I had forgotten to mention this earlier, apologies! How does this change things when it comes to shuffling and periodicities I was referring to in Update 1? THANK YOU! :)

      Does it change the probabilities: yes.

      Does it mean that shuffling twice is better than shuffling once: no.

      Any mix that can result from shuffling twice, could also result from shuffling once; and vice versa. So for every time when you would get a false hit after the first shuffle and not after a second; there is another case where you wouldn't get a false hit after the first shuffle and will after a second.

      I'll try to word that a different way: for any given set of data, there are a huge number of possible reorderings, a (relatively) small number of which would contain a false hit. Whether you shuffle once or 10 times; the odds of you producing one of those arrangements that contains a false hit remains exactly the same.

      The more fuzziness you allow, the higher the proportion of reorderings will contain a false hit. But with 3 or 4 mismatches in a 50 char set the ratio remains vanishingly small.

      But statistically, there is still no benefit at all to multiple shuffles.

      If my explanation isn't clear enough, perhaps running this might convince you:

      #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use List::Util qw[ shuffle ]; my @data = 'a'..'d'; my %once; ++$once{ join'',shuffle @data } for 1 .. 1e6; pp \%once; my %twice; ++$twice{ join'',shuffle shuffle @data } for 1 .. 1e6; pp \%twice; my %ten; ++$ten{ join'',shuffle shuffle shuffle shuffle shuffle shuffl +e shuffle shuffle shuffle shuffle @data } for 1 .. 1e6; pp \%ten ; __END__ C:\test>shuffleStats.pl { abcd => 41717, abdc => 41646, acbd => 41468, acdb => 41646, adbc => +41673, adcb => 42050, bacd => 41883, badc => 41624, bcad => 41523, bc +da => 41667, bdac => 41775, bdca => 41282, cabd => 41674, cadb => 415 +98, cbad => 41587, cbda => 41892, cdab => 41650, cdba => 41706, dabc +=> 41452, dacb => 41859, dbac => 41638, dbca => 41895, dcab => 41600, + dcba => 41495 } { abcd => 41481, abdc => 41541, acbd => 41422, acdb => 41699, adbc => +41601, adcb => 41502, bacd => 41610, badc => 42086, bcad => 41860, bc +da => 41864, bdac => 41537, bdca => 41770, cabd => 41669, cadb => 420 +34, cbad => 41649, cbda => 41568, cdab => 41802, cdba => 41802, dabc +=> 41745, dacb => 41742, dbac => 41405, dbca => 41441, dcab => 41628, + dcba => 41542 } { abcd => 41723, abdc => 41512, acbd => 41613, acdb => 41633, adbc => +41587, adcb => 41547, bacd => 42015, badc => 41615, bcad => 41706, bc +da => 41752, bdac => 41903, bdca => 41539, cabd => 41306, cadb => 420 +37, cbad => 41673, cbda => 41579, cdab => 41767, cdba => 41582, dabc +=> 42219, dacb => 41463, dbac => 41228, dbca => 41659, dcab => 41892, + dcba => 41450 }

      No matter how many times you shuffle the data, statistically, the results remain identical. Do stddev, chi2x or any other analysis you like, and the results will be the same. Use more data; more repetitions; more shuffles; they will remain the same.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!