Limitations before a table can be sharded or partitioned
P粉190883225
P粉190883225 2024-01-16 13:32:16
0
1
456

I am new to database system design. After reading many articles, I am really confused what is the limit we should have 1 table without sharding or partitioning. I know it's really difficult to provide a general answer, things depend on factors like

  • Row size
  • Data type (string, blob, etc.)
  • Number of active queries
  • What kind of query
  • index
  • Reread/Rewrite
  • Expected delay

But when someone asks this question

  • What would you do if 1 billion pieces of data and millions of rows were added every day? For such a large database, the latency for a 4 read, 1 write, and 2 update query needs to be less than 5 milliseconds.
  • If you only have 10 million rows, but high update and read volume, what would you choose? The number of new lines added does not matter. High consistency and low latency are requirements.

If the number of rows is less than a million, and the row size increases by thousands, the choice is simple. But things get trickier when the selection involves millions or billions of rows.

Note: I didn't mention the delay number in the question. please Answer based on the number of delays you are comfortable with. Also, we are talking about structured data.

I'm not sure, but I can add 3 specific questions:

  • Suppose you choose a SQL database for Amazon or any e-commerce order management system. The number of orders is growing by millions every day. There are already 1 billion records. Now, assume there is no data archive. High read queries with over a thousand queries per second. And also written. Read:write ratio is 100:1
  • Let's take an example of a now smaller number. Suppose you choose SQL database for abc or any e-commerce order management system. The number of orders is increasing by the thousands every day. There are already 10 million records. Now, assume there is no data archive. High read queries with over ten thousand queries per second. And also written. The read-write ratio is 10:1
  • Third example: Freebie distribution. We have 10 million goodies to give away. 1 goody per user. High consistency and low latency are the goals. Assuming there are already 20 million users waiting for the free distribution, once the time starts, all of them will try to get their hands on the free goodies.

NOTE: Throughout this question, it is assumed that we will choose SQL solution. Also, if the use case provided doesn't make logical sense, ignore it. The aim is to acquire numerical knowledge.

Can anyone help me understand what the benchmark is? Any real numbers from the project you're currently working on will show that for a large database with so many queries, this is the latency that is observed. Anything that can help me justify the number of select tables for a certain number of queries for a specific latency.

P粉190883225
P粉190883225

reply all(1)
P粉401901266

Some answers for MySQL. Since all databases are subject to disk space, network latency, etc. other engines may be similar.

  • Regardless of the number of rows, a "point query" (getting a row using a suitable index) takes several milliseconds.
  • It is possible to write a SELECT that takes hours or even days to run. So you need to understand if the query is pathological like this. (I think this is an example of high "latency".)
  • "Sharding" is required when you cannot sustain the required number of writes on a single server.
  • Large reads can be scaled "infinitely" by using replication and sending reads to replicas.
  • PARTITIONing (especially in MySQL) has few uses. More details: Partition
  • INDEX is very important for performance.
  • For data warehouse applications, building and maintaining "summary tables" is critical for large-scale performance. (Some other engines have some built-in tools.)
  • Inserting a million rows per day is not a problem. (Of course, some schema designs may cause this problem.) Rule of thumb: 100/sec may not be a problem; 1000/sec may be possible; after that it gets harder. More aboutHigh-speed ingest
  • Network latency mainly depends on the distance between the client and the server. It takes more than 200 milliseconds to reach the other side of the Earth. On the other hand, if the client and server are in the same building, the latency will be less than 1 millisecond. If on the other hand you are referring to how long it takes to run a query, then here are some rules of thumb: 10ms for a simple query that needs to hit the HDD disk; 1ms for an SSD.
  • UUIDs and hashes are very detrimental to performance if the data is too large to be cached in RAM.
  • I didn't mention the read/write ratio because I prefer to judge reading and writing independently.
  • "Ten thousand reads per second" is difficult to achieve; I think few applications really need this. Or they can find a better way to achieve the same goal. How quickly can a user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
  • (My opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. so what? Some benchmarks show that when you have more than a few hundred active connections, throughput stalls and latency tends to infinity. so what. Capturing actual queries after the application has been running for a while is probably the best baseline. But its uses are still limited.
  • Almost always a single table is better than a split table (multiple tables; partition; shard). If you have specific examples, we can discuss the pros and cons of table design.
  • Row size and data type - Large columns (TEXT/BLOB/JSON) are stored "unlogged", thus [potentially] causing additional disk hits. Disk hits are the most expensive part of any query.
  • Active Queries - After a few dozen times, queries will conflict with each other. (Imagine a grocery store with lots of shoppers pushing shopping carts - "too many" shoppers and everyone taking a long time to finish.)

When you get into large databases, they come in a few different types; each has some different characteristics.

  • Data warehouse (sensors, logs, etc.) - appended to the "end" of tables; summary tables for efficient "reporting"; huge "fact" tables (with optional chunked archive); certain "dimension tables".
  • Search (products, web pages, etc.) - EAV in question; full text is often useful.
  • Banking, Order Processing - This is very important for ACID functionality and the need to process transactions.
  • Media (Images and Videos) -- How to store huge objects while making searches (etc.) reasonably fast.
  • 'Find nearest' - requires a 2D index, SPATIAL or some technique here
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template