I am faced with a problem that is a variation of recipe "8.6. Picking a Random Line from a File" from the venerable cookbook ( clever use of rand($.) )
I have a file with "n" sets of "m" rows (lets assume they are sorted by the token that makes them into a set... so, if the rows have some attributes about people, the first token in each row is the name of the person, and there are "m" rows for, say 'punkish' and another "m" rows for 'paco', and so on). I want to grab random "j" rows from each set and write "n" sets of "j" rows out to another file.
I apologize that I am even unable to offer pseudo code to try and figure it out. It would be trivial to do it in a db, but I wouldn't mind knowing how to do this with just a file and the magic of Perl.
Oh! did I mention that (n * m) is a very large number, that is, we are talking about a file with around 8 million rows.
Update: on second glance, this post should really be titled grabbing random "j" rows from a file... oh well.
Update 2: (after bonking himself on the head for not providing a "compleat" problem the first time)
- Type of file: It is a delimited (say, CSV) file
- Are the lines fixed length?: No, but each row has the same number of fields, just like a CSV file
- Are there a fixed number of lones per "record"? Dunno what a "lone" is.
- Is any of this stuff indexed? It is a text file. How could it be index?
- Is this something that you need to do one off (or occasionally)? Occasionally... that is why the need for a program. But, would prefer to not use a database such as SQLite.
- Does the "data base" change over time? Yes, periodically. But, for every run, it is one, immutable file.
- If it changes can "records" be inserted? Can't change the input file.
- Why isn't this in a real database? Well, too long to answer here. Eventually it ends up in a database, so this is just the preprocessing part, but it is preferred to not do the preprocessing in a database.
- Does j change for each set, or do you want to print the same number of lines for each set? "j" doesn't change, however, the number of rows in a set may change. Incoming rows are supposed to be, say, 100 per set, and "j" is fixed at, say 90, but it is possible that a set might have only 80 rows, in which case, all 80 will be chosen. In other words, choose random "j" out of "m" if (j < m) else choose "m"
--
when small people start casting long shadows, it is time to go to bed
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|