http://qs321.pair.com?node_id=577433

This problem was originally presented to me by a co-worker as such,

I have a set of 100 files, and I want to randomly choose 5 of them. However, I want to weight the selection of each file based upon the number of words in the file. More words == greater chance it will be randomly selected. The files could contain between 100 - 3,000 words. What's a good way to do this?

First of all, the approach I'm going to detail here is specific to this example, but can easily be adapted to any random selection of weighted values. It should scale very easily to very large data sets (number of files, in this case), with very large weighting information (number of words, in this case). This post is mostly pseudo-code and explanation, not a functional piece of code. This solution has probably been created by other people before.

Okay, for sake of example, we're going to start off with 5 files.

my @files = qw(file_a.txt file_b.txt file_c.txt file_d.txt file_e.txt) +;

And we want to randomly choose 2 of the files. But, we want to weight our selections based off of the number of words in the file. More words in a file == more likely the file will be chosen.

Your first thought might be to build an array of all of the words in all of the files, then pick a random index, and determine which file the word is in. Note - you would need to pre-cache which word at which index is associate with which file. For example, the word "the" could appear at file_a.txt or file_b.txt. So you can't just randomly choose index 3, see "the" there, and know which file it's in. You have to know that index 3 => the => file_a.txt.

This is the first optimization. The words don't matter, only which file they're in. So instead of storing the word at each point, you can just store the file name. At this point, you'll need to be able to count the number of words in each file to get your weighting information. This is left as an exercise to the reader - use your favorite word counting widget. For example's sake, we'll say you end up with this structure:

my %words_in_files = ( 'file_a.txt' => 10, 'file_b.txt' => 1, 'file_c.txt' => 3, 'file_d.txt' => 5, 'file_e.txt' => 10 );

Now you can build up an array where the first 10 elements are "file_a.txt", the next one is "file_b.txt" and so on. For simplicity's sake, we'll display each file as its trailing letter ("file_a.txt" becomes "a"). This way, we can see our data:

aaaaaaaaaabcccdddddeeeeeeeeee

This approach is fully functional, but doesn't scale well. In our original problem, we had 100 files, with up to 3,000 words each. This is potentially a 300,000 element array, and it's just gonna get bigger if you add more files or words within the files. Don't get me wrong - perl can do it, but there's a better way.

The key is to realize that most of the information in that array is redundant. We're storing "a" 10 times. Do we really need to? Instead, we'll build a different data structure. In this structure, we'll store the file name, and the index at which the file begins. Externally, we'll also store a count of the total number of words. We end up with a structure along these lines:

my @indexes_of_files = ( # terminology: index 0 == "file offset, index 1 == "file name" [ qw( 0 file_a.txt ) ], [ qw( 10 file_b.txt ) ], [ qw( 11 file_c.txt ) ], [ qw( 14 file_d.txt ) ], [ qw( 19 file_e.txt ) ], ); my $total_number_of_words = 29;

Feel free to use hashrefs instead of arrayrefs, they may be easier to work with. I used arrayrefs here for simplicity of code display in the example. Note that the order of the files in this data structure is arbitrary. Whatever order the files are assigned in this array is irrelevant, so long as their file offsets change as appropriate.

We now have a much more compact data structure. Our algorithm is easy - generate a random integer between 0 and $total_number_of_words - 1 (0..28, in this case). Let's say that we generated "15".

Next, you need to search through the @indexes_of_files array to find the greatest file offset that's less than our generated number. Since the file information is in sorted order, a binary search can zip through the data in no time. Implementing the binary search (or whatever) algorithm is another exercise left to the reader.

However you find your data, you'll discover that you're looking at array element 4, which has file offset 14, corresponding to "file_d.txt". Note that if you count off 15 ticks into the "aaaa...b...cc..." array drawn out above, you'll also reach a "d", corresponding to file_d.txt.

You have now successfully chosen your first file, so you need to set up for subsequent ones. This is a 3 step op. One is easy, one is expensive, and one is tedious. First, the easy step.

Subtract from $total_number_of_words the length of the file you just chosen. In this case, file_d.txt has a length of 5 words, so $total_number_of_words becomes 24. You can re-calculate this length now using the the index of the element you were at and the index of the next element, you can cache it into the data structure, you can look up in the hash created earlier. Dealer's choice. But you need the length.

The "expensive" operation is simply to splice out the element at index [4].

Finally, for all elements >= the one you've removed ([4]), subtract from their file offset the length of the file just removed (5, in this case). You'll end up with this data structure when you're done:

my @indexes_of_files = ( # terminology: index 0 == "file offset, index 1 == "file name" [ qw( 0 file_a.txt ) ], [ qw( 10 file_b.txt ) ], [ qw( 11 file_c.txt ) ], # THIS FILE WAS REMOVED [ qw( 14 file_d.txt ) ], [ qw( 14 file_e.txt ) ], #this file offset was 19, is now 14. ); my $total_number_of_words = 24; #previously 29

And blam-o, you're set up to choose your next file, it's as if file_d.txt never existed, and you can repeat ad infinitum until you've selected enough files out.

Considerations

  • With tremendously long lists of data (for this example, say thousands or millions of files), you need to do a splice on a large array, and then run through all higher elements to do a subtraction. You're just iterating over an array and doing a subtraction on an integer, but it's still O(n). There may be fancier ways to do this w/o splicing or changing offsets, but I was unable to come up with an elegant one. They all seemed fragile and complicated relative to just decrementing the indexes. YMMV.
  • This can be applied to any set of data that you need to randomly select a value from based on a weighting value.
  • For smaller data sets, it may be simpler to just use the "aaa...b...cc..." approach of an array that lists all the file names. But it doesn't scale as well.
  • There may be something that does this efficiently already on CPAN.