The decision tree algorithm belongs to the category of supervised learning algorithms and is suitable for continuous and categorical output variables. It is usually used to solve classification and regression problems.
The decision tree is a tree structure similar to a flow chart, in which each internal node represents a test of an attribute, each branch represents the result of the test, and each node corresponds to a class label.
Start by treating the entire training set as the root.
For information gain, it is assumed that the attributes are categorical, and for the Gini index, it is assumed that the attributes are continuous.
Records are recursively distributed based on attribute values.
Use statistical methods to sort attributes to the root node.
Find the best attribute and place it on the root node of the tree.
Now, split the training set of the dataset into subsets. When making subsets, make sure that every subset of the training dataset should have the same attribute values.
Find leaf nodes in all branches by repeating 1 and 2 on each subset.
requires two stages of construction and operation:
Construction stage, preprocessing data set. Split the dataset from training and testing using the Python sklearn package. Train the classifier.
In the operation phase, make predictions. Calculation accuracy.
Data import, in order to import and manipulate data, we use the pandas package provided in python.
Here, we use the URL to get the dataset directly from the UCI site without downloading the dataset. When you try to run this code on your system, make sure that the system should have an active Internet connection.
Since the data set is separated by ",", we must pass the value of the sep parameter as.
Another thing to note is that the dataset does not contain headers, so we pass the value of the Header parameter as none. If we don't pass the header parameter, then it will consider the first row of the dataset as header.
Data slicing, before training the model, we must split the data set into training and test data sets.
To split the dataset for training and testing, we used the sklearn module train_test_split
First, we have to separate the target variable from the attributes in the dataset.
X=balance_data.values[:,1:5] Y=balance_data.values[:,0]
The above are the lines of code that separate the dataset. Variable X contains the attributes, while variable Y contains the target variable of the dataset.
The next step is to split the dataset for training and testing purposes.
X_train,X_test,y_train,y_test=train_test_split( X,Y,test_size=0.3,random_state=100)
The previous line splits the dataset for training and testing. Since we are splitting the dataset at a ratio of 70:30 between training and testing, we pass the value of test_size parameter as 0.3.
Therandom_state variable is the pseudo-random number generator state used for random sampling.
The above is the detailed content of The principle and implementation method of implementing decision tree algorithm in Python. For more information, please follow other related articles on the PHP Chinese website!