August 25, 2020
Decision Trees
Decision Trees are supervised learning models that try to find the patterns in the features of data points.
Split the data into smaller groups based on a feature. Repeat this until we reach a point where we decide to stop splitting the data into smaller groups.
GINI Impurity is a calculation of a set of data points in a decision tree where the lower the number the better.
If a dataset had three items of class A and one item of class B, the GINI Impurity of the set would be
1 - (3/4)^2 - (1/4)^2 = 0.375
Information Gain helps to decide which features to split in order to achieve a low GINI Impurity of the data before and after the split.
Decision Trees in scikit-learn
The sklearn.tree module contains the DecisionTreeClassifier class. To create a DecisionTreeClassifier object, call the constructor:
classifier = DecisionTreeClassifier()
Create classifier object
Fit the data with parameters training_data and training_labels
Predict classification with an array of data points (test data)
Score the model using test_data and Test_labels for accuracy
When creating a decision tree in scikit-learn, it’s a good idea to map strings to numerical values.
In general, larger trees tend to overfit the data more. One way to solve this is by pruning the tree. This is done by changing the parameter max_depth during the construction of the classifier.
Good Decision Trees have pure leafs. A leaf is pure if all of the data points in the class have the same label.
- GINI Impurity = 0