python implements decision tree algorithm

零到壹度
Release: 2018-04-19 16:50:33
Original
3314 people have browsed it

The example in this article describes the implementation of the decision tree algorithm in Python. Share it with everyone for your reference. The details are as follows:

from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO

# 读取csv数据,并将数据和特征值存入字典和类标签列表
allElectronicsData = open(r'AllElectronics.csv', 'rt')
reader = csv.reader(allElectronicsData)
headers = next(reader)
# 原代码中用的是:
# headers = reader.next()
# 这句代码应该是之前的版本用的,现在已经更新了没有next这个函数

# print(headers)

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row) - 1])
    rowDict = {}
    for i in range(1, len(row) - 1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)

# print(featureList)


# 将特征值矢量化,代表将各种参数进行矢量化
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()

# print("dummyX: " + str(dummyX))
# print(vec.get_feature_names())

# print("labelList: " + str(labelList))

# 将类标签列表矢量化,就是最后的结果
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
# print("dummyY: " + str(dummyY))

# 使用决策树进行分类
clf = tree.DecisionTreeClassifier()
# clf = tree.DecisionTreeClassifier(criterion = 'entropy')
clf = clf.fit(dummyX, dummyY)
# print("clf: " + str(clf))

# 将模型进行可视化
with open("allElectrionicInformationOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names = vec.get_feature_names(), out_file = f)

oneRowX = dummyX[0, :]
# print("oneRowX: " + str(oneRowX))

# 接下来改变一些数据进行预测
newRowX = oneRowX

newRowX[0] = 0
newRowX[1] = 1
print("newRowX: " + str(newRowX))

predictedY = clf.predict(newRowX.reshape(1, -1))  # 预测的结果需要加上后面的reshape(1, -1),不然会
# 报错:
# ValueError: Expected 2D array, got 1D array instead:
# array=[0. 1. 1. 0. 1. 1. 0. 0. 1. 0.].
# Reshape your data either using array.reshape(-1, 1)
# if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
print("预测的结果为: " + str(predictedY))
Copy after login


Classify the purchasing power of personnel to classify the projects. In the final process, It is also possible to make certain predictions about the results. See the code above, there are some advantages and disadvantages

Advantages of the decision tree algorithm:

  1) Simple and intuitive, the generated decision tree is very intuitive.

  2) There is basically no need for preprocessing, no need to normalize in advance, and deal with missing values.

 3) The cost of using decision tree prediction isO(log2 m)##O(log2m). m is the number of samples.

  4) It can handle both discrete values ​​and continuous values. Many algorithms only focus on discrete values ​​or continuous values.

  5) Can handle the classification problem of multi-dimensional output.

  6) Compared with black box classification models such as neural networks, decision trees can be well explained logically

  7) Cross-validation pruning can be used to select models, Thereby improving the generalization ability.

  8) It has good fault tolerance for abnormal points and high robustness.

Let’s look at the shortcomings of the decision tree algorithm:

1) The decision tree algorithm is very easy to overfit, resulting in weak generalization ability. This can be improved by setting the minimum number of samples for nodes and limiting the depth of the decision tree.

  2) The decision tree will cause drastic changes in the tree structure due to a slight change in the sample. This can be solved through methods such as ensemble learning.

 3) Finding the optimal decision tree is an NP-hard problem. We usually use heuristic methods and can easily fall into local optimality. This can be improved through methods such as ensemble learning.

  4) It is difficult for decision trees to learn some more complex relationships, such as XOR. There is no way around this. Generally, this relationship can be solved by using the neural network classification method.

 5) If the sample proportion of certain features is too large, the decision tree generated is likely to be biased towards these features. This can be improved by adjusting sample weights.


Related recommendations:

Detailed explanation of the decision tree of the top ten data mining algorithms

Decision tree algorithm

Decision tree algorithm principle and case

Decision tree algorithm implementation

The above is the detailed content of python implements decision tree algorithm. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template