How to maximise the content of my data CD

amaguk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to maximise the content of my data CD by brian_d_foy (Abbot) on Feb 25, 2005 at 12:20 UTC
You want a "multiple knapsack" or "Multiple Subset Sum" algorithm. Algorithm::Knapsack may be useful. Google for those and you're on your way. :) -- brian d foy <bdfoy@cpan.org>	[reply]
Re^2: How to maximise the content of my data CD by amaguk (Sexton) on Feb 25, 2005 at 13:20 UTC
Thanks, I've downloaded Algorithm::Knapsack, and I've found some links on the web.	[reply]
Re^3: How to maximise the content of my data CD by amaguk (Sexton) on Feb 25, 2005 at 13:35 UTC
And, with Algorithm::Knapsack come a tool named filesack which do : The filesack program finds one or more subsets of files or directories with the maximum total size not exceeding a given size.. And, it's exactly what I want Thank you guys !!!	[reply]
Re: How to maximise the content of my data CD by Limbic~Region (Chancellor) on Feb 25, 2005 at 13:56 UTC
amaguk, This is like the knapsack problem, but not quite. The difference is that you don't need to hit an exact target, you just need to not waste any more CDs then an exact target. For instance, you have 2GB worth of files. A perfect solution would have that fitting on 3 CDs with room to spare. As long as your solution doesn't require 4 CDs - you have sufficiently solved the problem. I got into a heated debate on this exact same problem in IRC some time ago and could have swore that I posted about it here - but can't find it. There is a recent similar thread (Burning ISOs to maximize DVD space), which mentions Algorithm::Bucketizer which I haven't tried myself. I would attempt the following: $buckets = ($total_size / 700) + 1 Order files by size in descending order Round robin files (1 per bucket) When you encounter first file that will not fit, stay with that bucket but continue down the list until you find one that fits On the next bucket, start back at the top of the file list Wash, rinse, repeat I am pretty sure the method will work. I was going to test it but the person complaining in IRC wouldn't provide a list of file sizes for me to try it out on and I wasn't motivated enough to make some up. Cheers - L~R Update: As pointed out below, perfect solutions that exactly match (or even very nearly match) a whole number of CDs will wind up costing you 1 extra CD. That's why the +1 as the first bullet.	[reply]
Re^2: How to maximise the content of my data CD by Eimi Metamorphoumai (Deacon) on Feb 25, 2005 at 14:28 UTC
Won't always work, though I can't prove how far off it'll be. Mainly, it should work pretty well when you have lots of extra space, but an approximate solution won't be "close enough" if you end up too close to exactly filling all the discs. Consider files of size 350, 349, 233, 232, and 231. You can fit them on 2 700 Mb discs (350+349, 233+232+231), but your algorithm will use 3 dics. (If you tried to use only 2, you'd end up with 350+233, 349+232, and the 231 wouldn't fit on either). What I can't prove without a lot more thought is whether you're ever going to be off by more than a single disc, and what can't be proven at all is whether that's close enough for real world purposes. (Since what's "acceptable" in the real world has to do with how long you're willing to wait for an answer vs. how much you care about that extra disc, and other factors.) But just know that the greedy approach won't only be suboptimal in theory, but it will also, sometimes, bleed over into an actual difference.	[reply]
Re^3: How to maximise the content of my data CD by Limbic~Region (Chancellor) on Feb 25, 2005 at 14:40 UTC
Eimi Metamorphoumai, Won't always work,... What I didn't say explicitly, but was implied by my bullet points was that 1 disc is being added to account for perfect (or even near perfect fits). The knapsack problem is hard but we aren't trying to break encryption we are trying to save a few pennies on CDs. I don't think (though I could be wrong) that it will ever waste more than 1 disk. Too make matters more difficult, we aren't talking about a handful of files but more likely hundreds if not thousands. Let's say that the total size is an exact multiple of 1 CD. That means every single CD needs to be an exact match (which may not even be possible). Proving it can or can not might take a while (extreme sarcasm). Why not just go with a "good enough" solution? Update 2008-11-26: It turns out that this heuristic approach can be is much as 11/9 OPT + 1 bin (according to bin packing). While my experience has been that 1 extra is all you will ever need, it is possible to need more. Cheers - L~R	[reply]
Re^2: How to maximise the content of my data CD by MidLifeXis (Monsignor) on Feb 25, 2005 at 17:57 UTC
How about file sizes of `($bucketsize / 2) + 1`. That would be the worst case scenerio, and would take N buckets (where N is the number of files). Although the OP is not talking about files that large (350Mb PDF files are a little large :), your algorithm can fail in the case of worst case data. --MidLifeXis	[reply] [d/l]
Re^3: How to maximise the content of my data CD by Anonymous Monk on Feb 25, 2005 at 22:22 UTC
Just for thought: If it is for backup, then spread the data, and add an additional CD as the checksum CD unit ie: Write a perl based CD RAID 5 then you could recover from a bad or missing CD... how many CD burners/PC's do you have available ??	[reply]
Re^3: How to maximise the content of my data CD by Zero_Flop (Pilgrim) on Feb 26, 2005 at 05:58 UTC
That would be the worst case scenerio, but I would not say that the algorithm would fail. It would be as accurate as already indicated. it would predict n+1 but in reality you would only need n. Zero	[reply]
Re^3: How to maximise the content of my data CD by Limbic~Region (Chancellor) on Feb 27, 2005 at 18:07 UTC
MidLifeXis, If each file is ($bucketsize / 2) + 1, it means only 1 file can fit per CD with either method so mine still only wastes the 1 extra CD. I am failing to see how your worst case scenario would make my solution use more than 1 extra CD? Cheers - L~R	[reply]
Re^4: How to maximise the content of my data CD by MidLifeXis (Monsignor) on Feb 28, 2005 at 18:53 UTC
Re^5: How to maximise the content of my data CD by Limbic~Region (Chancellor) on Feb 28, 2005 at 19:12 UTC
Re: How to maximise the content of my data CD by blazar (Canon) on Feb 25, 2005 at 12:56 UTC
This question tends to come up quite often lately. As it has already been pointed out to you it's basically the knapsack problem, which is known to be generally a "hard" problem. However a practical answer may depend on the actual average file sixe: if you only have files whose size is about say 1Mb or less, or at least you have a good wealth of such files along with potentially larger ones, then you may be content with a suboptimal solution given by filling up the space with as many of those files as possible. As a side note, outside of France (for what I know) Mo is spelled Mb...	[reply]
Re^2: How to maximise the content of my data CD by amaguk (Sexton) on Feb 25, 2005 at 13:02 UTC
Sorry for the Mb, I'm not French, but Belgian, and a French-speaking person. And thank you for your answer !	[reply]
Re: How to maximise the content of my data CD by inman (Curate) on Feb 25, 2005 at 12:58 UTC
Sounds like homework to me ... You could always use an archiving tool to compress all of the files and span them to the desired media size. I brushed off a version of pkzipc (on windows) and had a play. The following command compresses the data and creates a number of 700Mb files suitable for dropping onto CD. `pkzipc -add -span=1.44 c:save.zip *.doc` Similar opportunities exist with tar on UNIX systems.	[reply] [d/l]
Re^2: How to maximise the content of my data CD by amaguk (Sexton) on Feb 25, 2005 at 13:47 UTC
I've already thinked to archive tools, but the problem is if I want to read a specific file on one disk. I must rebuild the archive and extract it. It's too much effort for one file. And it's not a homework ;) Just a practical problem (I've a lot of PDF files from articles, excerpt of books, etc and I want to store all these files on CD), and my curiosity to do this with efficience	[reply]
Re^3: How to maximise the content of my data CD by Anonymous Monk on Feb 26, 2005 at 03:56 UTC
You want Burn to the brim.	[reply]
Re^2: How to maximise the content of my data CD by zakzebrowski (Curate) on Feb 25, 2005 at 17:53 UTC
FYI, see also hjsplit, which is a graphical splitter tool... Updated: typo. ---- Zak - the office	[reply]