Re: [OT] A measure of 'sortedness'?

Replies are listed 'Best First'.
Re^2: [OT] A measure of 'sortedness'? by BrowserUk (Patriarch) on Mar 18, 2015 at 22:48 UTC
Depending on the actual size of your buffer in comparison to, say, the number of indexes you have (I don't know what you mean by 10%, exactly) As an indication: the two buffers might be 2GB each. The 10% was to indicate that I don't have enough physical memory to allocate another 4GB in which to do a normal n-way merge. When merging in place, a typical requirement is that there are two segments, 1 in each of the buffers, that need to be interchanged: `ab cd ef gh [..............xxxxxxx....................][......yyyyyyyyyyy......... +...............]` [download] Above, the two buffers are in 'mid merge'. All the valued f..g are > a and < d; and all the values b..c are > e and < h. Typically, this would be done by exchanging all the xs with ys: `af d eb c gh [..............yyyyyyy....................][......xxxxxxxyyyy......... +...............]` [download] Then rotating the whole section d..c forward 1 at a time: `af d eb c gh [..............yyyyyyyy....................][......xxxxxxxyyy......... +...............] af d eb c gh [..............yyyyyyyyy...................][.......xxxxxxxyy......... +...............] af d eb cgh [..............yyyyyyyyyy..................][........xxxxxxxy......... +...............] af gd eb ch [..............yyyyyyyyyyy.................][.........xxxxxxx......... +...............]` [download] Which means all the records between original position of d, and the final position of c, have to move forward 1, 4 times. Which doesn't look so bad when the difference is just 4 and length is 20 or so, but if the difference is a few hundred or more, and the length (say) 1/2GB, it is very costly. n00 x 1/2GB. If instead, you can copy the 4 (or few hundred) to an out-of-line buffer, move the 20+ (or 1/2GB) forward by the length of that buffer just once, then put the ool buffer back, it saves lots of time. And if the out-of-line buffer (say 2MB) is smaller than the residual for the move (which will be a very rare event), then you do the move in two (or more) chunks. You've still divided the total number of record swaps by 2 million. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: [OT] A measure of 'sortedness'? by marinersk (Priest) on Mar 18, 2015 at 23:27 UTC
I know you said "call them numbers", but if they really are bytes of disk I/O data, the one-byte value range restriction does leave open the possibility of a very small index -- you could mark the first occurrence of each value in each of the buffers during the sequential scans. Other metadata could be computed as well, if they will be revisited frequently, like run-length of a particular value. During the merge operation, you might be able to speed up merge point decisions into what's left of the etherical 10% side buffer based on knowing where to go to get the next chunk of data that needs resequencing. In fact, a run-length table in lieu of actual data might buy you the space you need for a one-pass optimization. Forget saving index offset. Save the value and the number of them. (WARNING: Corner case where lots of length `< sizeof(integer)` runs causes the working buffer to actually be larger than the source buffers should be considered.). Then just overwrite the buffer with the resulting data. This would, of course, be an application-specific solution and not a general case proof for all airless merges, but sometimes that's the call the Engineer has to make. Okay, I'm done spamming you for the moment and heading home. This one sounds basic but fun; I'll be following this thread closely.	[reply] [d/l]
Re^4: [OT] A measure of 'sortedness'? by BrowserUk (Patriarch) on Mar 18, 2015 at 23:53 UTC
I think this has triggered a workable idea. Given a 2GB buffer of (minimum) 8-byte records = 268,435,456 records; using 1-bit per record requires 32MB. Well within my 10%. So, what can I record with 1-bit? My first though is that I can record whether this record is contiguous with the preceding one. Ie. Is is equal to it; or exactly one greater. In effect, contiguous runs of 1-bits become chunks of the buffer(s) that can be translocated to their final position en-masse. And (I think) by processing down those bitmaps in parallel (though not lock-step), and knowing the first value in each of the two buffers, it gives me a a straight forward map of how things have to be moved into their final positions. Does that convert the original O(N²) process into an O(2(or 3?)N) process? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^5: [OT] A measure of 'sortedness'? by marinersk (Priest) on Mar 19, 2015 at 00:31 UTC
Re^6: [OT] A measure of 'sortedness'? by BrowserUk (Patriarch) on Mar 19, 2015 at 03:31 UTC
Some notes below your chosen depth have not been shown here
Re^3: [OT] A measure of 'sortedness'? by sauoq (Abbot) on Mar 19, 2015 at 15:51 UTC
The more I've thought about this, the more I doubt the first pass is really going to buy much at all that'll make your merge more efficient. In the long run, you still have to do the same comparisons and the same moves. In terms of performance, it doesn't matter much which loop they're in. My original thought above, if it's feasible at all, only really adds overhead in almost every case. The only little chance of saving anything, I think, is if you have a chunk of data at the beginning or end of the combined buffer that doesn't have to move. You might be able to identify that on your first past and shorten your second pass to the data in between. That's all I got. `-sauoq "My two cents aren't worth a dime.";`	[reply]
Re^4: [OT] A measure of 'sortedness'? by BrowserUk (Patriarch) on Mar 19, 2015 at 21:18 UTC
The only little chance of saving anything, I think, is if you have a chunk of data at the beginning or end of the combined buffer that doesn't have to move. That's exactly what the pre-scan is doing. "Topping and tailing" the two buffers for bits that either don't need to move, or only need to be block-moved rather than merged: `\|---\| ---\| \| ___ --- \| ___--- --- \| ___--- --- \|__--- ---\| \| --- \| \|` [download] Don't read too much into the ascii art, but there are obviously two chunks, the left and right ends of the left buffer, that can be either left where they are or just block-moved into place, respectively. There are 5 other possibilities: 1 requires no further processing: all of the left buffer is less than all of the right buffer. ~~(Full scan required.)~~ 1 requires a single block swap; all of the left buffer is greater than all of the right buffer. ~~(Full scan required.~~ 4 (including the above) that either leave in place or block-move two smaller chunks and need to recurse on the other two. (Partial scan if left in place or full if moved.) What I was looking for was something discoverable or doable during that first scan, that abetted the next stage. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^4: [OT] A measure of 'sortedness'? by marinersk (Priest) on Mar 19, 2015 at 19:04 UTC
I'm coming to the same conclusion. Inevitably, all my creativity winds up solving the "how to conserve space" and to a lesser degree the "how to minimize data moves" issues -- and none of my thinking in those areas is yielding a first-pass-scan benefit. This is maddeningly entertaining because I can't count the number of times in my career I could optimize the inner algorithm if only I had the privilege of a pre-processing linear scan, which was too costly to be worth implementing. Here, I have the exact opposite problem -- I am going to get a pre-processing linear scan, and I can't make use of it to optimize the inner process. Somewhere, some deity of software development is having a huge guffaw at my expense. `:-)`	[reply] [d/l]


Welcome to the Monastery
	PerlMonks