How sklearn trains large-scale data sets - Stack Overflow

Question

Question 1: Now I have more than 400,000 pieces of data. I need to use some kind of machine learning classification algorithm to build a model for this data. The problem I encountered is that the data is too large and cannot be read at once, so I want to ask. What to do with the data? Question 2: There is something about sklearn cross-validation...

三叔 · Answer

I have been studying data mining and analysis of big data recently. Regarding question 1, I have an idea for your reference: since it cannot be read at once, you can build a distributed data model, read the data in batches, and determine the address datanode ( It can be a variable name), create a namenode (a table corresponding to the name and the address), and then when obtaining the data, first confirm the address in the namenode (which variable corresponds to the data that is needed), and then access the address to obtain The data is processed. Since I'm a beginner, I just provide my personal thoughts. The answer is not unique and is for reference only. If you don't like it, don't criticize it.

仅有的幸福 · Answer

400,000 is not much, a few gigabytes at most...
If the memory is really as small as 8G, then it still depends on your specific scenario. For example, simply counting tf-idf, one generator, only the last tf-idf dictionary is in memory.

Cross-validation is just to select the one with the smallest error. Behind the previous influence you mentioned is the concept of boosting.

仅有的幸福 · Answer

This kind of Q&A website is best to have one question and one pit. If necessary, two separate questions can be used to connect the links to avoid double-barreled questions

(1) See How to optimize for speed, you will find that there are many ways to control the experiment, including (a) using simple algorithms as much as possible (b) profiling memory usage and speed based on real-life conditions ( c) Try to replace all nested loops with Numpy arrays (d) Use Cython Wrapper if necessary to tune a more efficient C/C++ function library. These are just basic principles and directions. In fact, it still depends on the bottleneck analysis of the problem you want to operate, whether it is speed or space. After optimizing the code, you can consider whether to use parallel computing and other methods

(2) Your question has to distinguish between mathematical and empirical requirements. I hope you have a grasp of the empirical and mathematical significance of overfitting and underfitting. The questions and answers here are quite good. Reading them will help. .