in reply to How to maximise the content of my data CD
amaguk,
This is like the knapsack problem, but not quite. The difference is that you don't need to hit an exact target, you just need to not waste any more CDs then an exact target.
For instance, you have 2GB worth of files. A perfect solution would have that fitting on 3 CDs with room to spare. As long as your solution doesn't require 4 CDs  you have sufficiently solved the problem. I got into a heated debate on this exact same problem in IRC some time ago and could have swore that I posted about it here  but can't find it. There is a recent similar thread (Burning ISOs to maximize DVD space), which mentions Algorithm::Bucketizer which I haven't tried myself. I would attempt the following:
 $buckets = ($total_size / 700) + 1
 Order files by size in descending order
 Round robin files (1 per bucket)
 When you encounter first file that will not fit, stay with that bucket but continue down the list until you find one that fits
 On the next bucket, start back at the top of the file list
 Wash, rinse, repeat
I am pretty sure the method will work. I was going to test it but the person complaining in IRC wouldn't provide a list of file sizes for me to try it out on and I wasn't motivated enough to make some up.
Update: As pointed out below, perfect solutions that exactly match (or even very nearly match) a whole number of CDs will wind up costing you 1 extra CD. That's why the +1 as the first bullet.
Re^2: How to maximise the content of my data CD
by Eimi Metamorphoumai (Deacon) on Feb 25, 2005 at 14:28 UTC

Won't always work, though I can't prove how far off it'll be. Mainly, it should work pretty well when you have lots of extra space, but an approximate solution won't be "close enough" if you end up too close to exactly filling all the discs. Consider files of size 350, 349, 233, 232, and 231. You can fit them on 2 700 Mb discs (350+349, 233+232+231), but your algorithm will use 3 dics. (If you tried to use only 2, you'd end up with 350+233, 349+232, and the 231 wouldn't fit on either).
What I can't prove without a lot more thought is whether you're ever going to be off by more than a single disc, and what can't be proven at all is whether that's close enough for real world purposes. (Since what's "acceptable" in the real world has to do with how long you're willing to wait for an answer vs. how much you care about that extra disc, and other factors.) But just know that the greedy approach won't only be suboptimal in theory, but it will also, sometimes, bleed over into an actual difference.  [reply] 

Eimi Metamorphoumai,
Won't always work,...
What I didn't say explicitly, but was implied by my bullet points was that 1 disc is being added to account for perfect (or even near perfect fits). The knapsack problem is hard but we aren't trying to break encryption we are trying to save a few pennies on CDs. I don't think (though I could be wrong) that it will ever waste more than 1 disk. Too make matters more difficult, we aren't talking about a handful of files but more likely hundreds if not thousands. Let's say that the total size is an exact multiple of 1 CD. That means every single CD needs to be an exact match (which may not even be possible). Proving it can or can not might take a while (extreme sarcasm). Why not just go with a "good enough" solution?
Update 20081126: It turns out that this heuristic approach can be is much as 11/9 OPT + 1 bin (according to bin packing). While my experience has been that 1 extra is all you will ever need, it is possible to need more.
 [reply] 
Re^2: How to maximise the content of my data CD
by MidLifeXis (Monsignor) on Feb 25, 2005 at 17:57 UTC

How about file sizes of ($bucketsize / 2) + 1. That would be the worst case scenerio, and would take N buckets (where N is the number of files).
Although the OP is not talking about files that large (350Mb PDF files are a little large :), your algorithm can fail in the case of worst case data.
 [reply] [d/l] 

That would be the worst case scenerio, but I would not say that the algorithm would fail. It would be as accurate as already indicated. it would predict n+1 but in reality you would only need n.
Zero
 [reply] 

Just for thought:
If it is for backup,
then spread the data,
and add an additional CD as the checksum CD unit
ie: Write a perl based CD RAID 5
then you could recover from a bad or missing CD...
how many CD burners/PC's do you have available ??
 [reply] 

MidLifeXis,
If each file is ($bucketsize / 2) + 1, it means only 1 file can fit per CD with either method so mine still only wastes the 1 extra CD. I am failing to see how your worst case scenario would make my solution use more than 1 extra CD?
 [reply] 

My issue is not that you are wasting too many CDs, my issue is that you are not estimating enough CDs.
Maybe I misunderstood you originally when you said ...
$buckets = ($total_size / 700) + 1
... but this code (I think) implements your code. The test is then run against a set of buckets to verify that the bucket estimation is, in fact, correct. In this worst case, the number of files should be the number of buckets (or within one).
sub numbuckets {
my $numfiles = shift;
my $cdsize = 700 * 1024 * 1024;
my $total_size = (($cdsize / 2) + 1) * $numfiles;
my $buckets = ($total_size / $cdsize) + 1;
return $buckets;
}
printf("Bucket estimation of %d buckets is %s for %d files.\n",
numbuckets($_),
($_ <= numbuckets($_)) ? "valid" : "not valid",
$_)
foreach (0 .. 5);
__END__
# My output...
Bucket estimation of 1 buckets is valid for 0 files.
Bucket estimation of 1 buckets is valid for 1 files.
Bucket estimation of 2 buckets is valid for 2 files.
Bucket estimation of 2 buckets is not valid for 3 files.
Bucket estimation of 3 buckets is not valid for 4 files.
Bucket estimation of 3 buckets is not valid for 5 files.
The answer is somewhere between $numbuckets (or $numbuckets  1) and $numfiles. Where in that range depends on how densely you can pack the files.
 [reply] [d/l] 


