comment on

creating an offset index requires reading the entire file and negates the purpose of taking a random sample

Yes, creating the index would require reading the whole file (although I would imagine this could be done very quickly in C). If the point of the random sample was to avoid reading the whole file, then I would also agree that creating the index would negate the purpose of the random sample. However, without trying to split hairs or begin a thread that in the grand scheme of things is moot, I would contend that creating an index may be important in other use cases (not necessarily yours)^*. For example, if a completely unbiased random sample were needed, or if performing the statistical calculation(s) on the whole file would be prohibitive (so the cost of creating the index was small in comparison), etc.

for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum

Ah, this helps. In this case I would expect the bias introduced by the variation in the length of the record to be relatively small.

Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

The answer to that question is dependent on the use case. If you are not bothered by it, I certainly won't argue otherwise. :-)

Given this information, I would try the "seek to a random position" approach. I would also suggest that under these conditions the bias due to record length is probably so small that taking the next record (instead of the one selected by the seek) may not be necessary (although programmatically it may be more convenient to do so).

In my very non-statistical opinion sampling about 1/10,000th of the file might be enough to infer the average record length, but I would want to know something about the distribution of record lengths before calculating the predicted min and max length because the shape of the curve will impact the number of samples that are required to obtain an estimate with an accuracy below a given threshold (set by you, of course). Therefore, depending on how critical the estimates need to be, you might want to employ an empirical approach that keeps taking samples until the error in the calculated values is below your given threshold.

^*Rationale provided only to provide context of my line of thought.

In reply to Re^7: Random sampling a variable length file. by bobf
in thread Random sampling a variable record-length file. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks