Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^8: Random sampling a variable length file.

by BrowserUk (Patriarch)
on Dec 27, 2009 at 11:41 UTC ( [id://814503]=note: print w/replies, xml ) Need Help??


in reply to Re^7: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.

Thankyou! That's what my intuition is telling me. I was hoping one of the math guys around these parts (the set of whom you may or may ot be a member, I have no way of knowing:), would be able to put some semi-formal buttressing behind that intuition.

But in the absence of that, the fact that at least one other person has a similar intuition--and define the logic for it in their own words--, and no strong counter argument has been stated, gives me a good enough feeling to make it worth while pursuing it to the next level. Ie. coding up something crude and attempting to define a test scenario to substantiate it.

Any thoughts on a test scenario that might avoid the mistake of inherently confirming what I'm looking for?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^8: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^9: Random sampling a variable length file.
by bellaire (Hermit) on Dec 27, 2009 at 13:35 UTC
    Not sure. You'd probably need some way to estimate whether your sampling distribution is uniform with respect to the index of the sample. Also, you could see whether the average length of the records in your sample jibes with the average length of records in the entire population.

    My other thoughts overnight had to do with the pathological case presented by bobf:

    • To avoid the scenario where you pick the same record 90% of the time if one record is 90% of the file, you need to avoid already-selected records.
    • To give the large record itself a fair chance of being selected, you need to perform the wrapping suggested by bcrowell2, that is, selecting the first record if you land inside the last.

    Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

    So even if you are hitting the big record 90% of the time, you ignore it after the first time, and then other 10% of the hits select records as normal. Since any record at all can follow the 90% length record, that's fair. And since the length of the last record has nothing to do with the length of the first, it has same same likelihood of being selected as any record.
      Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

      Thankyou again! That makes a great deal of sense.

      My first reaction was that remembering whether I had already picked a record was an awkward prospect given I olny have the byte position and no nknowledge of how long it is, then it dawned on me querying the offset once I've read the partial record make for a perfect signature.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://814503]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-19 09:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found