Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^4: Random sampling a variable length file.

by BrowserUk (Patriarch)
on Dec 26, 2009 at 18:39 UTC ( [id://814438]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

  1. There is no meaningful correlation in the ordering of the records.

    The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.

    But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

  2. The file is static. It is only processed once.
  3. It is often huge. Time is of the essence.
  4. Reading the whole file to pick a sample negates the purpose of picking a sample.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^4: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^5: Random sampling a variable length file.
by bobf (Monsignor) on Dec 26, 2009 at 22:05 UTC

    Disclaimer: I am not a statistician. I don't even play one on TV.

    The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.
    But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

    I had the same concern. Intuitively, I do not see how the bias would be negated by reading the next record, since there is a positional dependence between the two. For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

    If you want a random sample from the count of records, it may be difficult to use a selection method that is based on length.

    The file is static. It is only processed once.

    Would it be possible to generate an index in parallel with the creation of the file? If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index? A list of offsets would be sufficient to accomplish this task and the approach would be very straightforward (think maintenance).

      Would it be possible to generate an index in parallel with the creation of the file?

      No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.

      If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?

      No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.

      I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

      Agreed. In extremis, it doesn't.

      But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.

      The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.

      I realise that it is corrolated to the length of its positional predecessor, but does that matter?

      If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

      The type of information being inferred (estimated) from the sample:

      • min/ave/max record length.
      • The number of records within the file.
      • The ascii-betical (or numerical) distribution of some field (or fields) within the file.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        creating an offset index requires reading the entire file and negates the purpose of taking a random sample
        Yes, creating the index would require reading the whole file (although I would imagine this could be done very quickly in C). If the point of the random sample was to avoid reading the whole file, then I would also agree that creating the index would negate the purpose of the random sample. However, without trying to split hairs or begin a thread that in the grand scheme of things is moot, I would contend that creating an index may be important in other use cases (not necessarily yours)*. For example, if a completely unbiased random sample were needed, or if performing the statistical calculation(s) on the whole file would be prohibitive (so the cost of creating the index was small in comparison), etc.

        for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum
        Ah, this helps. In this case I would expect the bias introduced by the variation in the length of the record to be relatively small.

        Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?
        The answer to that question is dependent on the use case. If you are not bothered by it, I certainly won't argue otherwise. :-)

        Given this information, I would try the "seek to a random position" approach. I would also suggest that under these conditions the bias due to record length is probably so small that taking the next record (instead of the one selected by the seek) may not be necessary (although programmatically it may be more convenient to do so).

        In my very non-statistical opinion sampling about 1/10,000th of the file might be enough to infer the average record length, but I would want to know something about the distribution of record lengths before calculating the predicted min and max length because the shape of the curve will impact the number of samples that are required to obtain an estimate with an accuracy below a given threshold (set by you, of course). Therefore, depending on how critical the estimates need to be, you might want to employ an empirical approach that keeps taking samples until the error in the calculated values is below your given threshold.

        *Rationale provided only to provide context of my line of thought.

        You could try a generative approach to answering this question. Generate random data, use your method to sample the data. Then scrutinize your samples to see whether they are, in fact, randomly selected, or if there appears to be a bias.

        My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.

      That's a good point, bobf. I think there are two possibilities.

      (1) He only needs to get a random sample from this file once.

      (2) He needs to get random samples from this file more than once, and needs each one to be random not only in and of itself but also in the sense of being uncorrelated with the other samples.

      If it's #1, then I think it works to take a random byte position and then read the next record. If it's #2, then he can't use that method, and I think he clearly would be better off creating in index (or using the facilities of a database or filesystem).

Re^5: Random sampling a variable length file.
by bcrowell2 (Friar) on Dec 26, 2009 at 18:43 UTC
    Yeah, if they're uncorrelated, then there's no bias introduced. You may need to special-case the situation where you randomly land on the final record, in which case you have to pick the first record.
Re^5: Random sampling a variable length file.
by eye (Chaplain) on Dec 27, 2009 at 07:02 UTC
    1. There is no meaningful correlation in the ordering of the records.

    How certain are you that this is true? If there is no correlation between any characteristic of interest in a record and the record's position within the file, then taking a sequential sample from an arbitrary location in the file (like the beginning) is entirely unbiased by record size. It's also a very efficient way (computationally, not statistically) to sample the file.

    You ask a number of highly technical questions, like "[h]ow many records should you pick?" Answers to this typically range from rules of thumb to equations for computing a sample size that meets some specification. What to use is highly dependent on what you are trying to do. Meeting regulatory requirements is very different from monitoring operations. Can you say more about what you are trying to do?

      How certain are you that this is true? If there is no correlation between any characteristic of interest in a record

      Simple. I cannot know what will be inside the file! Because the user may apply the process to any file of their choosing.

      So, Just as the polster might discover that the "random selection" they make of the populous, happens to coincidentally consist of the entire membership of some extremist political organisation, they cannot know it will until they take the sample.

      Put another way, there may be all manner of correlations, but none of them are known, and so cannot be utilised.

      So, at this point, the problem is how to take a statistically valid, random sample of records of any file, without resorting to reading the entire file. I've described the inferences to be drawn elsewhere.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        ...Just as the polster might discover that the "random selection" they make of the populous, happens to coincidentally consist of the entire membership of some extremist political organisation, they cannot know it will until they take the sample.

        A valid sample may contain every member of an extremist group or only contain records whose lengths are even numbers. What makes the sample valid is the plan for selecting the sample (and adherence to that plan). Developing a sampling plan is non-trivial. It can be aided by things we know theoretically and by observational data. But sometimes you just get atypical samples.

        In this case, I was confused by your statement in another response that "there is no meaningful correlation in the ordering of the records." This sounds like a positive assertion that there is no correlation based on order. This is quite different from the negative assertion that you don't know whether there are order related correlations.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://814438]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2024-04-23 09:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found