We have a large table with millions of entries. Full counting is very slow, see code below. Is this common for MySQL InnoDB tables? Isn't there a way to speed up this process? Even with query caching, it's still "slow". I'd also like to know why the count of the "communication" table with 2.8 mio entries is slower than the count of the "transaction" table with 4.5 mio entries.
I know it would be faster to use where clause. I'm just wondering if poor performance is normal.
We use Amazon RDS MySQL 5.7 and m4.xlarge (4 CPUs, 16 GB RAM, 500 GB storage). I've also tried larger instances with more CPU and RAM, but no big change in query times.
mysql> SELECT COUNT(*) FROM transaction; +----------+ | COUNT(*) | +----------+ | 4569880 | +----------+ 1 row in set (1 min 37.88 sec) mysql> SELECT COUNT(*) FROM transaction; +----------+ | count(*) | +----------+ | 4569880 | +----------+ 1 row in set (1.44 sec) mysql> SELECT COUNT(*) FROM communication; +----------+ | count(*) | +----------+ | 2821486 | +----------+ 1 row in set (2 min 19.28 sec)
This is supported usingMultiple Version Concurrency Control (MVCC).
InnoDB allows your queries to be isolated within transactions without blocking other concurrent clients that are reading and writing data rows. These concurrent updates do not affect your transaction's view of the data.
But, what is the number of rows in the table, considering that many rows are being added or deleted while the count is being made? The answer is vague.
Your transaction should not be able to "see" row versions created after the transaction started. Likewise, your transaction should count rows even if someone else requests row deletion, but they do so after your transaction starts.
The answer is that when you perform a
SELECT COUNT(*)
or any other type of query that requires checking multiple rows, InnoDB must accesseachrow to see the The current version is visible to the transaction view of the database and is counted while it is visible.In tables that do not support transactions or concurrent updates (such as MyISAM), the storage engine retains the total number of rows as metadata for the table. The storage engine cannot support multiple threads updating rows simultaneously, so the row total is less ambiguous. So when you request SELECT COUNT(*) from a MyISAM table, it will only return the number of rows in memory (but this is of no use if you do a SELECT COUNT(*)) Use the WHERE clause by Some conditions count some subset of rows, so in this case it must actually count them).
In general, most people think that InnoDB's support for concurrent updates is well worth it, and they are willing to sacrifice the optimization of
SELECT COUNT(*)
.Except what Bill said...
Minimum index
InnoDB selects the "smallest" index to perform
COUNT(*)
. It is possible that the index of allcommunication
is greater than the minimum index oftransaction
, so there is a time difference. When determining the size of an index, include thePRIMARY KEY
column along with any secondary indexes:To measure size,
PRIMARY KEY
is large because it contains (due to clustering) all columns of the table.INDEX(flag)
is "5 bytes".INDEX(name)
On average there may be dozens of bytes.SELECT COUNT(*)
will explicitly selectINDEX(flag)
.Apparently
Transactions
have a "small" index, butCommunications
do not.TEXT
/BLOG
Columns are sometimes stored "unlogged". Therefore, they are not included in the size of the PK index.Query cache
If "Query Cache" is turned on, the second run of the querymaybe much faster than the first. But this only happens if the table has not changed. QC is rarely useful in production systems because any change to a table invalidates all QC entries for that table. By "faster" I mean about 0.001 seconds; not 1.44 seconds.
The difference between 1m38s and 1.44s may be due to what is cached in buffer_pool (InnoDB's general cache area). The first run probably couldn't find any "minimal" index in RAM, so it did a lot of I/O, taking 98 seconds to get all 4.5 million rows for that index. The second run found that all the data was cached in the buffer_pool, so it ran at CPU speed (no I/O), so it was much faster.
good enough
In this case, I fundamentally question the need to do
COUNT(*)
. Note how you say "2.8 mio entry" as if 2 significant figures is "good enough". Wouldn't that be "good enough" if you showed the count to the user on the UI? If so, one solution for performance would be to count it once a day and store it somewhere. This will allow immediate access to "good enough" values.There are other technologies. One is to use activity tags or some form of summary table to keep the counters updated.
Throw hardware at it
You've found that changing the hardware doesn't help.
SELECT COUNT(*)..
(Too little RAM will cause the second runs very slowly) .)