Actually, this is not a locking issue, but a data distribution issue. Locking is to prevent dirty data from being generated under high concurrency, and you actually want the data that has been processed or has been obtained by other threads not to be processed again, right?
How to distribute data and improve cluster (or multi-thread) processing efficiency should be considered in conjunction with your data model.
For example, if the data ID being processed has a numeric identifier and you currently have 10 machines or 10 threads, then each of these 10 machines can read 1/10 of the data. This can be done by taking the remainder (%10) arrive. For example, the first machine reads the data of the ID of i%10==1, the second machine reads the data of the ID of i%10==2, and so on.
You can consider using a queue to try it. Scan the entire table and put the data waiting to be processed into the queue (single thread), and then consume it in multiple threads. Since dequeuing itself is atomic, repeated reads can be prevented and performance is guaranteed (especially redis). In addition, if single-threaded table scanning on the production side is not enough, multi-threaded modulo reading of data can be considered and put into the queue.
I don’t understand. You can’t get tens of millions of data into the cache at once!!!!
I saw again that you said that every piece of data in the table has a mark.<--Just look at this mark before modifying it.
When changing this data, you only need to ensure the operation of the relevant data bean of this data on the Java side (it may be one step or multiple steps)
Just make it atomic.<--After you change it, if other threads obtain the data, it will be what you changed.
For update operations with large amounts of data, it is faster to use jdbc batch operations.
Actually, this is not a locking issue, but a data distribution issue. Locking is to prevent dirty data from being generated under high concurrency, and you actually want the data that has been processed or has been obtained by other threads not to be processed again, right?
How to distribute data and improve cluster (or multi-thread) processing efficiency should be considered in conjunction with your data model.
For example, if the data ID being processed has a numeric identifier and you currently have 10 machines or 10 threads, then each of these 10 machines can read 1/10 of the data. This can be done by taking the remainder (%10) arrive. For example, the first machine reads the data of the ID of i%10==1, the second machine reads the data of the ID of i%10==2, and so on.
You can consider using a queue to try it. Scan the entire table and put the data waiting to be processed into the queue (single thread), and then consume it in multiple threads. Since dequeuing itself is atomic, repeated reads can be prevented and performance is guaranteed (especially redis). In addition, if single-threaded table scanning on the production side is not enough, multi-threaded modulo reading of data can be considered and put into the queue.
I don’t understand. You can’t get tens of millions of data into the cache at once!!!!
I saw again that you said that every piece of data in the table has a mark.<--Just look at this mark before modifying it.
When changing this data, you only need to ensure the operation of the relevant data bean of this data on the Java side (it may be one step or multiple steps)
Just make it atomic.<--After you change it, if other threads obtain the data, it will be what you changed.
For update operations with large amounts of data, it is faster to use jdbc batch operations.
Use queue RabbitMQ, producer-consumer model