I am new to database system design. After reading many articles, I am really confused what is the limit we should have 1 table without sharding or partitioning. I know it's really difficult to provide a general answer, things depend on factors like
- Row size
- Data type (string, blob, etc.)
- Number of active queries
- What kind of query
- index
- Reread/Rewrite
- Expected delay
But when someone asks this question
- What would you do if 1 billion pieces of data and millions of rows were added every day? For such a large database, the latency for a 4 read, 1 write, and 2 update query needs to be less than 5 milliseconds.
- If you only have 10 million rows, but high update and read volume, what would you choose? The number of new lines added does not matter. High consistency and low latency are requirements.
If the number of rows is less than a million, and the row size increases by thousands, the choice is simple. But things get trickier when the selection involves millions or billions of rows.
Note: I didn't mention the delay number in the question. please
Answer based on the number of delays you are comfortable with. Also, we are talking about structured data.
I'm not sure, but I can add 3 specific questions:
- Suppose you choose a SQL database for Amazon or any e-commerce order management system. The number of orders is growing by millions every day. There are already 1 billion records. Now, assume there is no data archive. High read queries with over a thousand queries per second. And also written. Read:write ratio is 100:1
- Let's take an example of a now smaller number. Suppose you choose SQL database for abc or any e-commerce order management system. The number of orders is increasing by the thousands every day. There are already 10 million records. Now, assume there is no data archive. High read queries with over ten thousand queries per second. And also written. The read-write ratio is 10:1
- Third example: Freebie distribution. We have 10 million goodies to give away. 1 goody per user. High consistency and low latency are the goals. Assuming there are already 20 million users waiting for the free distribution, once the time starts, all of them will try to get their hands on the free goodies.
NOTE: Throughout this question, it is assumed that we will choose
SQL solution. Also, if the use case provided doesn't make logical sense, ignore it. The aim is to acquire numerical knowledge.
Can anyone help me understand what the benchmark is? Any real numbers from the project you're currently working on will show that for a large database with so many queries, this is the latency that is observed. Anything that can help me justify the number of select tables for a certain number of queries for a specific latency.
Some answers for MySQL. Since all databases are subject to disk space, network latency, etc. other engines may be similar.
SELECT
that takes hours or even days to run. So you need to understand if the query is pathological like this. (I think this is an example of high "latency".)PARTITIONing
(especially in MySQL) has few uses. More details: PartitionINDEX
is very important for performance.Inserting
a million rows per day is not a problem. (Of course, some schema designs may cause this problem.) Rule of thumb: 100/sec may not be a problem; 1000/sec may be possible; after that it gets harder. More aboutHigh-speed ingestWhen you get into large databases, they come in a few different types; each has some different characteristics.
SPATIAL
or some technique here