comment on

Would it be possible to generate an index in parallel with the creation of the file?

No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.

If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?

No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.

I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

Agreed. In extremis, it doesn't.

But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.

The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.

I realise that it is corrolated to the length of its positional predecessor, but does that matter?

If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

The type of information being inferred (estimated) from the sample:

min/ave/max record length.
The number of records within the file.
The ascii-betical (or numerical) distribution of some field (or fields) within the file.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

In reply to Re^6: Random sampling a variable length file. by BrowserUk
in thread Random sampling a variable record-length file. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks