August 25, 2020

Decision Trees

Decision Trees are supervised learning models that try to find the patterns in the features of data points.

Split the data into smaller groups based on a feature. Repeat this until we reach a point where we decide to stop splitting the data into smaller groups.

GINI Impurity is a calculation of a set of data points in a decision tree where the lower the number the better.

If a dataset had three items of class A and one item of class B, the GINI Impurity of the set would be
1 - (3/4)^2 - (1/4)^2 = 0.375

Information Gain helps to decide which features to split in order to achieve a low GINI Impurity of the data before and after the split.

Decision Trees in scikit-learn
The sklearn.tree module contains the DecisionTreeClassifier class. To create a DecisionTreeClassifier object, call the constructor:
classifier = DecisionTreeClassifier()

  1. Create classifier object

  2. Fit the data with parameters training_data and training_labels

  3. Predict classification with an array of data points (test data)

  4. Score the model using test_data and Test_labels for accuracy

When creating a decision tree in scikit-learn, it’s a good idea to map strings to numerical values.

In general, larger trees tend to overfit the data more. One way to solve this is by pruning the tree. This is done by changing the parameter max_depth during the construction of the classifier.

Good Decision Trees have pure leafs. A leaf is pure if all of the data points in the class have the same label.
- GINI Impurity = 0

Next
Next

August 24, 2020