Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Opening random files then reading random lines from file.

by JavaFan (Canon)
on Apr 27, 2012 at 09:18 UTC ( [id://967563]=note: print w/replies, xml ) Need Help??


in reply to Re: Opening random files then reading random lines from file.
in thread Opening random files then reading random lines from file.

Exactly how this works is complicated.
No, it isn't. How it works is as follows: we read a file, line by line. For the kth line read, we decide with probability 1/k whether we remember the line, forgetting any previous one.

That's fairly trivial. It takes just a little analysis to see it's fair. Fair means that any line will be the final pick. So, what's the probability of line k to be the final pick? First what needs to be happen is that it gets remembered when the line is read, which has a probability of 1/k. Then, any next lines must *not* be picked. For each of the next lines, this happens with probability (j -1)/j, where j is the line number. Since all those probabilities are independent of each other, we can multiply them:

           n
        -------
    1    |   |    j-1        1     k      1 
   ---   |   |    ---   ==  --- * --- == ---
    k    |   |     j         k     n      n
         j=k+1
Too bad we can't really do math in HTML.
  • Comment on Re^2: Opening random files then reading random lines from file.

Replies are listed 'Best First'.
Re^3: Opening random files then reading random lines from file.
by Marshall (Canon) on Apr 27, 2012 at 11:56 UTC
    I think you made my point that it is "complicated"
    (not easily explained and understood)..
      You and I must have a wildly different idea what is complicated.

      If you find a one-liner using a bog standard file iteration loop complicating, I wonder what you think of sort.

        1/k is certainly not complicated in the least.

        The proof that it is correct is trivial for a proof.

        The only part I can see that could be called complex is the fact that the algorithm is not immediately obvious and needs a proof. (Even if just an engineering "proof"-by-example.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://967563]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-20 01:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found