Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Disclaimer: I am not a statistician. I don't even play one on TV.

The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.
But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

I had the same concern. Intuitively, I do not see how the bias would be negated by reading the next record, since there is a positional dependence between the two. For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

If you want a random sample from the count of records, it may be difficult to use a selection method that is based on length.

The file is static. It is only processed once.

Would it be possible to generate an index in parallel with the creation of the file? If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index? A list of offsets would be sufficient to accomplish this task and the approach would be very straightforward (think maintenance).


In reply to Re^5: Random sampling a variable length file. by bobf
in thread Random sampling a variable record-length file. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-19 15:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found