I have a large table with over 3B rows and my query now takes 3 days to execute.
This is my query:
insert into frequencywords (word, frequency, filename) select word, count(*) as frequency, max(filename) from allwordstemp group by word
Basically, my query is to group by word from allwordstemp table, and I want to know the filename when frequency = 1, that's why I added max(filename), Because filename needs to be included in the aggregate function, such as max. If frequency > 1, I don't need the value of filename either. Both tables have 2 indexes on word and filename.
allwordstemp table (filename is an id file):
CREATE TABLE `allwordstemp` ( `word` varchar(45) DEFAULT NULL, `filename` int(11) DEFAULT NULL, KEY `idx_allwordstemp_word` (`word`), KEY `idx_allwordstemp_filename` (`filename`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
frequencywords table:
CREATE TABLE `frequencywords` ( `word` varchar(45) DEFAULT NULL, `frequency` int(11) DEFAULT NULL, `filename` int(11) DEFAULT NULL, KEY `idx_frequencywords_word` (`word`), KEY `idx_frequencywords_frequency` (`frequency`), KEY `idx_frequencywords_filename` (`filename`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Explain selection:
---- ------------- -------------- -- ---------- ------- --------------- ------------------ ----- --------- ------ ------------ ---------- -------- ----- | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | ---- ------------- -------------- ------------ ------- --------------- ----------------------- ---------- --- ------------------------------------- | 1 | SIMPLE | allwordstemp | NULL | index | NULL | idx_allwordstemp_word | 138 | NULL | 3487864881 | 100.00 | Using index | ---- ------------- -------------- ------------ ------- --------------- ----------------------- ---------- --- --- ---------------- ---------- -------------
How can I make my query faster?
Your filter appears to be based on frequency, not words or filenames. So I would first index all the words by frequency.
Then, assuming frequency is an integer, I would add a WHERE clause like this:
The above will give you a list of all words that appear only once in various file names.
Hope this helps, and all the best!