mongodb - pymongo count is slow
PHPz
PHPz 2017-05-17 10:03:55
0
1
1229

Thirty thousand pieces of data, each piece of data contains only one random number {"digit": random number}
Requirement: Count the number that appears most often
Database table table

def main():
    digits = []
    for d in table.find():
        n = d['digit']
        digits.append(n)
    dig = set(digits)

    news = []
    i = 0
    for d in dig:
        c = table.find({"digit": d}).count()
        zz = (d, c)
        news.append(zz)
        print(i)
        i += 1

if __name__ == '__main__':
    start = time.time()
    main()
    print('Cost: {}'.format(time.time() - start))

It takes about five or six minutes to run once. Using multi-threading to run 100 is not much faster, and the fan is very loud...
What is the correct posture?

PHPz
PHPz

学习是最好的投资!

reply all(1)
迷茫

The correct posture is to use aggregation.

db.table.aggregate([
    {$group: {_id: "$digit", count: {$sum: 1}}},    // 统计每个数字出现的次数
    {$sort: {count: -1}},    // 逆序排列
    {$limit: 1}    // 取第1条记录
]);

Users of $group can refer to the documentation.
It should be noted that the possibility of such a demand appearing in reality is not high. It is estimated that this is a practice question for you. In fact, even if Aggregatoin is used, it is still necessary to traverse all the data in the entire collection to find the most frequent number. Therefore, when the total number of records in the collection is relatively large, such a full table traversal operation cannot be fast. This kind of search method is usually only available in OLAP scenarios, and OLAP usually does not have high speed requirements. Therefore, only from a theoretical discussion, the aggregation framework should be used, but the real needs still require detailed analysis.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template