Question one:
Now I have more than 400,000 pieces of data, and I need to use some kind of machine learning classification algorithm to build a model for this data. The problem I encountered is that the data is too large and cannot be read at once, so Want to ask how to process data?
Question 2:
I have a question about sklearn cross-validation: If I have 10,000 training data, these 10,000 training data sets can be divided into n groups of training using the KFold method based on the cross-validation principle (train data accounts for 0.7). Now What I don’t understand is that I fit() the training set of the first group, and then performed prediction verification on the test set to get the prediction accuracy. But what is the use of getting the prediction accuracy? Will it affect the next training session? Also, will the last trained model be used in the next fit() function?
我最近在学大数据的数据挖掘与分析这一块,对于问题一,我有个思路你参考一下:既然无法一次性读取,可以建立分布式数据模型,分次读取数据,确定地址datanode(可以是某个变量名),建立一个namenode(名字与该地址对应的表),然后获取数据的时候,先在namenode中确认地址(需要的是哪一个变量对应的数据),再访问该地址获取数据进行处理。由于初学,我只是提供下我个人的思路,答案不唯一,仅供参考,各路大牛不喜勿喷。
40万没多少啊,顶多几G吧......
如果真的是内存小到8G也没有,那还是得看你具体场景啊,举个列子,单纯算tf-idf,一个generator,内存中只有最后的tf-idf字典。
交叉验证只是为了选取误差最小的一个,你提到的前面的影响后面,是boosting的概念。
這種問答網站最好是一個問題一個坑,必要時兩個分開的問題給連結連相關性,避免 Double-barreled question
(1) 見How to optimize for speed,你會發現有很多可以調控試驗的方式,包括(a)儘量使用簡單的演算法計巧 (b)針對現實狀況做記憶体使用及速度的側寫 (c)試著用Numpy陣列取代所有nested loops (d)必要時使用Cython Wrapper 去調更有效率的C/C++函數庫。這些只是基本原則和方向,實際上還是要看你要操作問題的瓶頸分析,是速度還是空間,把代碼最佳化後再考慮是否要用平行計算等手段
(2) 你這問題得區分 數學 和 實證 上要求的差異,希望你對 过拟合(overfitting)及 underfitting的 實證及數學意義有所掌握,這裡的問答還蠻不錯的,讀一下有幫助的。