http://qs321.pair.com?node_id=814456


in reply to Re^4: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

Disclaimer: I am not a statistician. I don't even play one on TV.

The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.
But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

I had the same concern. Intuitively, I do not see how the bias would be negated by reading the next record, since there is a positional dependence between the two. For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

If you want a random sample from the count of records, it may be difficult to use a selection method that is based on length.

The file is static. It is only processed once.

Would it be possible to generate an index in parallel with the creation of the file? If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index? A list of offsets would be sufficient to accomplish this task and the approach would be very straightforward (think maintenance).