Is it wise to create different tables for the same data for each date?

Question

I have a MYSQLInnoDB table with the following columns (table and column names changed): date (PK, date) var_a (PK, FK, INT) var_b (PK, FK, INT) rel_ab (decimal) where rel_ab is the description Column of relationship between 2 variables var_a and var_b given date. (var_a and var_b refer to different tables) Data is uploaded in batches every day, totaling about 7 million rows per day. The problem was, after just a few weeks, it started taking hours to upload each new daily batch. obviously

P粉781235689 · Answer

There are some potential solutions that can help you improve the upload speed of your MySQL tables:

Delete the indexes on var_a and var_b: Since you are not using these indexes to speed up queries, deleting them can help speed up the upload process. However, if you use foreign key constraints, it is generally recommended to keep indexes on the columns that belong to the foreign key.

Partition the table by date: Partitioning helps improve query performance because it allows the database to scan only the relevant partitions for a given query. However, it also makes maintenance and backups more complex, which may not be necessary if your queries are already performing well.

Use bulk insert method: Instead of using df.to_sql to insert individual rows, you can try using a bulk insert method such as LOAD DATA INFILE or the MySQL bulk insert API. This is faster than inserting individually, especially if you can upload the data in batches rather than one row at a time.

Use a different compression algorithm: You are currently using zlib compression, but there are other compression algorithms that may be faster or more efficient for your data. You can try trying different compression options to see if they improve upload speeds.

Increase server resources: If you have the budget and resources, upgrading server hardware or increasing the number of servers may help increase upload speeds. This may not be a viable option for everyone, but it's worth considering if you've exhausted your other options.

As far as your suggested option is concerned, removing the foreign key constraints may cause data integrity issues, so I don't recommend this approach. If your query is already experiencing performance issues, partitioning by date may be a good solution, but if your query is already running quickly, it may not be necessary.

P粉098979048 · Answer

To speed up uploads, delete them. Seriously, if the only thing you're doing is getting exactly what's in a file for a certain date, why put the data into a table? (Your comment points out that a single file is actually several files. It might be a good idea to combine them first.)

If you do need the data in the table, let's discuss these...

Before determining the index, we must look at all major queries.
The order of columns in the PK is important for both loading and querying.
Partitioning may help with loading, but is unlikely to help with querying. Exceptions: Do you delete "old" data?
Please provide Show creation table; the content you provide may miss some subtleties.
How is loading done? A huge Loading data? Hopefully not inserting one row at a time. I don't know how pandas works. (Nor do you know how the 99 other packages that "simplify" MySQL access work.) Please understand what it does behind the scenes. You may have to bypass Pandas to get better performance. Bulk loading is at least 10 times faster than row-by-row loading.
I haven't seen the need for a temporary table while loading. Maybe. If you remove the FK (as you suggested), you can execute queries to verify the existence of var_a and var_b in other tables. That's "analog FK".
If feasible, sort incoming data based on PK. (This may be the root cause of slow economic growth.)
Are there any auxiliary keys? They affect loading speed.
I think your FK implies indexes on other tables.
Are you adding new rows to other tables?
"rel_ab (DECIMAL)" - how many decimal places? What is the exact statement? If it's some kind of measurement, have you considered FLOAT?
Now there are many rows in other tables. That is, do you really need a 4-byte INT to reference them? Switch to 3 bytes MEDIUMINT [UNSIGNED] Save at least 7MB per day.
How do you handle the 7 million rows in that SELECT?
No compression. InnoDB is very inefficient. Only one of the 4 columns may be compressible. Compression requires additional buffer_pool_space. Compression uses a lot of CPU. For InnoDB, 2x shrinkage is typical.

Multiple "identical" tables are always unwise. A table is always better. However, as suggested above, a zero table is still better.