There is such an application scenario: the user has two consecutive operations A and B. Operation A must be completed before operation B can be performed. If operation B is triggered before operation A is completed, it is displayed that the user needs to perform operation A first, that is To execute operation B, you need to query whether operation A has been executed. The issue that arises here is to record user participation records and provide query methods for users and operations. When the amount of data is different, our storage solutions will be very different. As the data grows, the solutions continue to evolve.
1. The amount of data is small and the user operation behavior is fixed:
Storage: MySQL
Solution: We use UID as the key, one user per row, and the users included in each user are stored as columns, such as UID=100 , the fixed storage is operation A and operation B, the table structure is roughly as follows:
table_operation
uid operation_a operation_b
100 1 1
If we want to query whether the user participates in A or B, Directly using SQL: SELECT * FROM table_operation WHERE uid=100 AND action_a=1 can achieve the goal.
Problem: User operations are fixed and expansion is difficult. If you need to increase user operation behavior, you need to add fields or increase table storage. The method of adding fields is feasible if the data is below a certain level (such as 1 million) , if the behaviors are unrelated, adding a table storage solution will perform well.
2. The amount of data is small and the user operation behavior is not fixed:
Compared with scenario 1, in addition to the uid variable, the current scenario adds user operation variables, that is, we need to pay attention to both the user and the user operation. variables.
Storage: MySQL
Option 1: Add an operation table, generate an operation id, and store uid and oid in the user operation behavior table. When the user performs a new operation, a record is inserted into the operation behavior table. The table structure is roughly as follows:
table_operation_info
oid name
1 operation_a
2 operation_b
table_operation
uid oid
1 1
1 2
When you need to query whether user 1 has performed operation A, use SQL: SELECT * FROM table_operation WHERE oid=1 AND oid=1.
Problem: When the user's operation behavior is large, the user's operation behavior increases very quickly, and the amount of data gradually increases. MySQL may not be able to load a single table. The solution is explained in the subsequent scenario.
3. The amount of data is large and the user behavior is fixed
Storage: MySQL
Solution: Compared with scenario 1, the current scenario is different in that the amount of data is larger than that of scenario 1, and the amount of data is so large that MySQL alone The table cannot be loaded. This solution solves this problem. When a single table is too large, the most cost-effective method is to use separate tables. The variable of our current scene is uid, and we only need to divide the tables by levels based on uid.
4. The amount of data is large and the user behavior is not fixed
Storage: MySQL
Scheme 1: This scheme is applied to situations where the user’s operating behavior can be classified, that is, it is added on the basis of scenario 1 Two table splitting operations, table splitting by operation behavior category and table splitting by user. In the current solution we need to deal with two variables: operating behavior and users. The two table splits correspond to these two variables respectively. The table split operation is performed according to the business rules, and the data is horizontally split according to the user ID to reduce the amount of data.
Scheme 2: This scheme is a complete horizontal table splitting operation. Based on the scheme of scenario 2, it is split according to the user level.
5. The amount of data is extremely large
Storage: MySQL
Option 1: Split databases and tables. At this time, one database can no longer meet the needs. The rules are implemented based on the previous scenario and can be considered based on actual needs. Put different libraries on different machines.
Option 2: Based on the sub-database and sub-table, store it in bits, because whether an operation has been executed or not is a status, that is, 0, 1 status, so we can use one bit to store, 64 bits Can store 64 operation behavior tags.
Other storage
key-value database
Our needs actually do not require too many relational database functions. A simple k-v database can realize our functions and will also improve performance. After all, the improvement will be faster if you do less.
Regardless of whether to choose memory-based or non-memory (you can choose according to actual needs, or hot data can be in memory and silent data can be in non-memory), assuming we have enough space to store it.
Option 1:
Using uid+oid as the key, the value can store the status or only whether to participate (0 and 1), but there will be too many keys, especially when the amount of data is extremely large. The number of uids * the number of oids may be of a magnitude that you cannot imagine.
Option 2:
Generally speaking, the amount of data on user operation behavior is completely smaller than the user's magnitude, and the data on user operation behavior is controllable. If we want to reduce the number of keys, we can use oid + user partition index id as the key. The so-called user partition index here refers to dividing users into a certain number of areas, and all users are recorded in this interval, such as 10000 is an interval, then users with uid 1 to 9999 are divided into interval 0. Here, 1 and 0 can be used to store whether the user has performed this operation. The value corresponding to a key is initialized to store 10000 0s. When the user with uid=100 performs an operation, the 100th 0 is set to 1.
Option 3:
Based on option 2, convert 10,000 0s into 10,000 01 bits. Assuming that one bit stores 50 bits, only 200 are needed in total.
Option 4:
When the number of users is extremely large, most users may not participate in a certain operation. Based on option 3, we add simple sparse matrix compression to each storage bit. Add an index and store the value only when it is not 0.
Option 5:
I haven’t thought of it yet, I look forward to your sharing
Summary
• As the amount of data increases, the general idea is to divide it into When the table cannot be processed, divide the database when one library cannot be processed. When a machine cannot be processed, add a machine.
• Cost and demand need to be considered when choosing different storage media, and all choices are the result of a balance.
• Save space, store by bits.
•Don’t optimize prematurely.