Heh. Very much not.
My current "design braindump" includes the following features:
- Create and maintain schemas for complex objects.
- Maintain bidirectional object relationships. (Think master-child relationships - from the master you should be able to find the children, and each child needs to know its master. This should be automatically maintained.)
- Ability to dump networks of related objects.
- Ability to load them elsewhere.
- A conflict resolution algorithm in case two different clients updated an object at the same time without seeing what the other was doing.
In short I'm really tackling the sort of problems that an ORM on top of a relational database makes easy. | [reply] |
What order of scale are you hoping for?
Mechanisms that will work well for say 4 to 16 nodes will often fail hopelessly if you try to scale them to 100 or 1000 nodes. Conversely, algorithms that will scale to 1000 nodes will usually be relatively inefficient if used for only 4 or 8 nodes.
A conflict resolution algorithm in case two different clients updated an object at the same time without seeing what the other was doing.
In general, it is far better to avoid this possibility than to design algorithms to handle it. Synchronisation always imposes high overheads on all operations. Even read(only) ones.
The best approach to distributed data management--assuming your application can be made to fit--is to distribute your objects across the nodes, but only allow the owning node to manipulate the object. Ie. route all operations on an object to its owning node. (Or nodes for failover; but only to secondaries if the primary fails.)
A quick browse of Riak link provided shows that it does this for you at the physical data (disk) level, but you will still need to provide a similar mechanism, perhaps based upon the underlying 160-bit space, at the application logic level.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
What order of scale are you hoping for?
Dozens of peer nodes, with peak performance of at most dozens of writes per second per node. (Usually it will be quieter than that. The nodes will mostly be used for other stuff, Riak should be running in the background.) Performance and throughput are not bottlenecks here - one machine can easily do that. The issue is availability, and the desire to avoid having another specialized machine per cluster.
The best approach to distributed data management...
Sorry, there is no best approach. The CAP theorem says that you can choose any two of Consistency, Availability, and Partition Tolerance. Depending on your application, it may be appropriate to wind up be at any corner.
Riak is at the AP corner. That is appropriate for what I am trying to build. We expect conflicts to be very rare. Ones that cannot easily be merged should be much, much rarer still. A low remaining error rate would be acceptable. Writes will come from all nodes we are running at. Internal networking problems or localized hardware problems should not limit the ability of other nodes to function as best they can.
Your suggestions would be appropriate if we were trying to wind up at the CA or CP corners. We're not.
A quick browse of Riak link provided shows that it does this for you at the physical data (disk) level, but you will still need to provide a similar mechanism, perhaps based upon the underlying 160-bit space, at the application logic level.
That is one piece of what it looks like I need to write.
| [reply] |