perlquestion
punkish
I am faced with a problem that is a variation of recipe "8.6. Picking a Random Line from a File" from the venerable cookbook ( clever use of <code>rand($.)</code> )
<p>
I have a file with "n" sets of "m" rows (lets assume they are sorted by the token that makes them into a set... so, if the rows have some attributes about people, the first token in each row is the name of the person, and there are "m" rows for, say 'punkish' and another "m" rows for 'paco', and so on). I want to grab random "j" rows from each set and write "n" sets of "j" rows out to another file.
<p>I apologize that I am even unable to offer pseudo code to try and figure it out. It would be trivial to do it in a db, but I wouldn't mind knowing how to do this with just a file and the magic of Perl.
<p>
Oh! did I mention that <code>(n * m)</code> is a very large number, that is, we are talking about a file with around 8 million rows.
<p><b>Update:</b> on second glance, this post should really be titled <i>grabbing random "j" rows from a file</i>... oh well.
<p>
<b>Update 2:</b> <i>(after bonking himself on the head for not providing a "compleat" problem the first time)</i>
<ul>
<li><b>Type of file:</b> It is a delimited (say, CSV) file</li>
<li><b>Are the lines fixed length?:</b> No, but each row has the same number of fields, just like a CSV file</li>
<li><b>Are there a fixed number of lones per "record"?</b> Dunno what a "lone" is.</li>
<li><b>Is any of this stuff indexed?</b> It is a text file. How could it be index?</li>
<li><b>Is this something that you need to do one off (or occasionally)?</b> Occasionally... that is why the need for a program. But, would prefer to not use a database such as SQLite.</li>
<li><b>Does the "data base" change over time?</b> Yes, periodically. But, for every run, it is one, immutable file.</li>
<li><b>If it changes can "records" be inserted?</b> Can't change the input file. </li>
<li><b>Why isn't this in a real database?</b> Well, too long to answer here. Eventually it ends up in a database, so this is just the preprocessing part, but it is preferred to not do the preprocessing in a database.</li>
<li><b>Does j change for each set, or do you want to print the same number of lines for each set?</b> "j" doesn't change, however, the number of rows in a set may change. Incoming rows are supposed to be, say, 100 per set, and "j" is fixed at, say 90, but it is possible that a set might have only 80 rows, in which case, all 80 will be chosen. In other words, choose random "j" out of "m" if (j < m) else choose "m"</li>
</ul>
<!-- Node text goes above. Div tags should contain sig only -->
<div class="pmsig"><div class="pmsig-231169">
--
<br><br><i>
when small people start casting long shadows, it is time to go to bed</i>
</div></div>