Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I am faced with a problem that is a variation of recipe "8.6. Picking a Random Line from a File" from the venerable cookbook ( clever use of rand($.) )

I have a file with "n" sets of "m" rows (lets assume they are sorted by the token that makes them into a set... so, if the rows have some attributes about people, the first token in each row is the name of the person, and there are "m" rows for, say 'punkish' and another "m" rows for 'paco', and so on). I want to grab random "j" rows from each set and write "n" sets of "j" rows out to another file.

I apologize that I am even unable to offer pseudo code to try and figure it out. It would be trivial to do it in a db, but I wouldn't mind knowing how to do this with just a file and the magic of Perl.

Oh! did I mention that (n * m) is a very large number, that is, we are talking about a file with around 8 million rows.

Update: on second glance, this post should really be titled grabbing random "j" rows from a file... oh well.

Update 2: (after bonking himself on the head for not providing a "compleat" problem the first time)

  • Type of file: It is a delimited (say, CSV) file
  • Are the lines fixed length?: No, but each row has the same number of fields, just like a CSV file
  • Are there a fixed number of lones per "record"? Dunno what a "lone" is.
  • Is any of this stuff indexed? It is a text file. How could it be index?
  • Is this something that you need to do one off (or occasionally)? Occasionally... that is why the need for a program. But, would prefer to not use a database such as SQLite.
  • Does the "data base" change over time? Yes, periodically. But, for every run, it is one, immutable file.
  • If it changes can "records" be inserted? Can't change the input file.
  • Why isn't this in a real database? Well, too long to answer here. Eventually it ends up in a database, so this is just the preprocessing part, but it is preferred to not do the preprocessing in a database.
  • Does j change for each set, or do you want to print the same number of lines for each set? "j" doesn't change, however, the number of rows in a set may change. Incoming rows are supposed to be, say, 100 per set, and "j" is fixed at, say 90, but it is possible that a set might have only 80 rows, in which case, all 80 will be chosen. In other words, choose random "j" out of "m" if (j < m) else choose "m"
--

when small people start casting long shadows, it is time to go to bed

In reply to grabbing random n rows from a file by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-26 01:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found