There are three commonly used replication routines: The database ensures that each data is copied to three independent disks on three different computers. The reason for this is simple: disks only fail at a certain moment. If one disk dies, you have time to replace it, and you can still take one of your other two copies. to restore the data and write it to the new disk. The probability that the second disk will die before you restore it is so low that the probability of both your disks dying at the same time is as slim as an asteroid hitting the earth.
We have also specially calculated that the probability of one disk failure is almost 0.1% (maybe rough), the possibility of two disks failing is almost 10 to the power of 6, and three disks failing at the same time The probability is about 10 to the power of -9, which is one in a billion. This calculation shows that the failure of one disk is independent of the failure of other disks - this is not very accurate, for example, if your disks are all produced from the same production line, then they may all be bad - But that's good enough for our idea.
So far this seems reasonable, but unfortunately this is not true for many data storage systems. In this blog I will show you why.
If your database cluster only contains three machines, the possibility of all machines crashing at the same time is too low (excluding related errors, such as a data center being destroyed). However, once you use larger clusters, the problem changes accordingly. The more nodes and disks you use in the cluster, the more likely you are to lose data.
This is based on calculations. You may be thinking "Really? I have copied the data to three disks. How can the probability of failure become higher as the cluster grows. Manage the cluster What's the capacity?" But I calculated the possibilities and explained why to you through the following icon:
Obviously, this is not the possibility of one node failing - it is the possibility of permanently losing all three copies of the data, so restoring the data from a backup is only the conservative approach. The larger your cluster, the more likely you are to lose data. This is something you might not think of when you think about paying to replicate your data.
The Y-axis of the graph is a bit arbitrary and relies on a lot of imagination, but the direction of the lines is incredible. Based on previous assumptions, the probability of a node failing at some point is 0.1%. However, the figure shows that in a cluster with 8,000 nodes, the probability of permanently losing three copies of a data is approximately 0.2 %. Yes that's right, the risk of losing all three copies is twice the risk of losing one node's data. So what are these copies used for?
Intuitively judge from this picture: In a cluster with 8,000 nodes, it is common for some nodes to go down at certain times. This may not be a problem: a certain probability of chaos and node replacement can be inferred, and part of it is routine maintenance. However, if you are unlucky and the destination node of the node data you copied is down, then your data will never be retrieved. The loss of data is a relatively small part of the overall data set of the cluster, but when you lose three replicas, you may think "I really don't want to lose this data," rather than "I didn't expect to accidentally lose some data. , although they are not very large. "Maybe this part of the missing data is an important part of the data.
The possibility that all three replicas are bad nodes depends on the replication algorithm used by the system. The above diagram simply relies on the data being divided into a certain number of partitions (or shards), so that each partition stores three randomly selected nodes (or pseudo-random hash function). This is a special case of consistent hashing, used in Cassandra and Riak (as far as I know). I'm not sure how other systems distribute replication work, so I'm looking at it from someone who knows the internals of a multi-storage system.
Let me show you how I calculated the graph above using a probabilistic model of a replicated database.
Assume that the probability of an independent node losing data is p=P (node loss). I'm going to ignore time in this model and briefly look at the probability of failure during certain time periods. For example, we can assume that p=0.001 is the probability of a node failing on a certain day. It is reasonable to spend a day replacing the node and dumping the lost data to the new node. To put it simply, I don't want to distinguish between node failure and disk failure. I will only talk about permanent failure.
Let n be the number of nodes in the cluster. f is the number of failed nodes (assuming failures are relatively independent) and is binomial distributed:
The expression is the probability of failure of f nodes. The expression is the probability of ensuring that n-f nodes do not fail. It is the number of f nodes extracted from n in different ways. Pronounced "n choose f", it is defined as:
. . . . . .
The specific derivation process will not be described in detail here. Based on the above formula, we can derive the probability of losing one or more partitions in a cluster with n nodes and a replication factor of (number of replicated backup nodes). If the number of failed nodes f is less than the replication factor, we can be sure that no data was lost. However, we need to add all possibilities when f is between r and n:
This is slightly verbose, but I think it's accurate. If you let r=3, p=0.001, k=256n, n is between 3 and 10000, then you can get the picture above. I wrote some ruby programs to implement this calculation.
We use union binding to a simpler guess:
Although the failure of one partition is not completely independent of other partitions, this conjecture still applies. It seems to be closer to the experimental results: on the way, the probability of data loss is more like a straight line, which is proportional to the number of nodes. Conjecture shows that the probability is positively related to the number, and we assume that each node has a fixed 256 partitions.
How it performs in practice, I'm not sure. But I think this is an interesting computationally sensitive phenomenon. I've heard of situations where companies with large database clusters have experienced real data loss. But it’s not very common in articles and reports. If you are currently studying this topic, you can tell me.
The calculated results show that if you want to reduce the possibility of data loss, you should reduce the number of partitions and increase the replication factor. Using more backups costs more, so this is already expensive when considering large clusters. However, the number of partitions indicates a meaningful load balancing process. Cassandra originally had one partition per node, but later it became 256 partitions per node to cope with better load distribution and efficient secondary balancing.
You need to find reasonably large clusters before these can really work, but clusters of thousands of levels are used by many large companies. So I'm interested in hearing from people with hands-on experience in this area. If the probability of permanent data loss for 10,000 nodes is controlled within 0.25% every day, it means that 60% of the data will be lost in a year.
As a designer of distributed data systems, what do you think after reading this article? If what I'm saying is right, more consideration should be put into designing replication schemes. I hope this article can increase your awareness of reality. Because 3 replication nodes are really not that safe.
The above is the detailed content of On data loss in large clusters. For more information, please follow other related articles on the PHP Chinese website!