[ decision-trees  CART  ID3  supervised-learning  classification  machine-learning  numpy  scikit-learn  digits  python  ipynb  ]

Decision Trees for Multinomial Classification

Introduction

What is covered in this post:

Often times, classification can be framed as a sort of questioning-answering system. Questions are asked about the input data which aid the model in determining a prediction. One example of a “question” that a model might ask is does this input image contain this attribute value? Decision trees naturally help to structure this kind of if-then hierarchical decision-making by defining a series of questions that lead to a class label or value. In this article, we will explore several algorithms for constructing the two types of decision trees: the ID3 algorithm for Classification Trees and CART analysis for Regression Trees. We’ll start with the CART Regression Tree implementation from Scikit-learn and eventually work our way to the ID3 algorithm. While reading along, you will be able to implement your own ID3 algorithm from scratch using the code provided in this notebook. Let’s get started…

1. Built in Scikit-learn DecisionTreeClassifier

Objective #1: Use and experiment with the built-in Scikit-learn DecisionTreeClassifier (based on CART)

Scikit-learn digits dataset

1. Load the digits dataset from the datasets provided in Scikit-learn. Inspect the data. What is in there? For this classification task, we will be using the Scikit-learn digits dataset. Recall that the digits dataset consists of 1797 samples across 10 classes. Each sample is an 8x8 image of a single handwritten digit from 0 to 9. Each sample therefore has 64 features, where each of the 64 features is a brightness value of a pixel in the image.

from sklearn.datasets import load_digits
digits = load_digits()

Now that we’ve got our data loaded, let’s take a look at a sample of the images in the dataset.

digits.data
array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

We can see from the above that each pixel is stored in an array. The pixel value is represented by an integer between 0 and 16. To get a better look at the data visually, we can run the following code to preview a few samples of each class chosen at random.

import numpy as np
import matplotlib.pyplot as plt
def visualize_random(images, labels, examples_per_class=8):
    """
    Display a sample of randomly selected images per class
    """
    number_of_classes = len(np.unique(labels))
    
    for cls in range(number_of_classes):
        idxs = np.where(labels == cls)[0]
        idxs = np.random.choice(idxs, examples_per_class, replace=False)
        for i, idx in enumerate(idxs):
            plt.subplot(examples_per_class, number_of_classes, i * number_of_classes + cls + 1)
            plt.imshow(images[idx].astype('uint8'), cmap=plt.cm.gray_r, interpolation='nearest')
            plt.axis('off')
            if i == 0:
                plt.title(str(cls))
    plt.show()
visualize_random(digits.images, digits.target, examples_per_class=8)

Output 1. Random sample of Scikit-learn digits from each class 0-9.

2. Split your data set into 70% training data (features and labels), and 30% test data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3)
print("Number of training examples: ",len(X_train))
print("Number of test examples: ",len(X_test))
print("Number of total examples:", len(digits.data))
Number of training examples:  1257
Number of test examples:  540
Number of total examples: 1797

NOTE: We would typically rescale the images’ pixel values prior to running the data through our model. In the case of the digits data, the pixel values are in the range [0-16]. In order to rescale the images, we would divide the pixel values by a factor equal to the maximum pixel value (16.0). This would transform every pixel value from the range [0-16] to [0,1].

Rescaling, in general cases, is important for two main reasons:

  1. Maintaining consistency across datasets whose images have differing pixel ranges, and
  2. Preserving the learning rate across the data.

Without scaling, images with higher pixel ranges contribute more to the loss than those with lower ranges. The learning rate cannot update effectively to compensate for this if the images have different or high ranges.

Now, with that out of the way–we will not be scaling the digits data for use with the ID3-algorithm. Decision trees are said to be invariant to feature scaling, and in the case of the ID3-algorithm, scaling the image data to continuous values could actually result in overfitting (many more branch points at each split). Continuous attribute values like scaled pixel intensities can also be more time-consuming to process with the ID3-algorithm (source: Wikipedia).

Scikit-learn DecisionTreeClassifier

3. Set up a DecisionTreeClassifier as it comes in Scikit-learn. Use it with default parameters to train a decision tree classifier for the digits dataset based on the training data. Follow the tutorial (or the respective documentation) and produce a plot of the tree with graphviz if you have it available. What can you learn from this about how the used algorithm handles the data?

As mentioned earlier, Scikit-learn’s DecisionTreeClassifier is based on the CART (Classification and Regression Tree) algorithm. Like the name suggests, CART can produce either Classification or Regression trees depending on whether the dependent variable (the target or class variable) is categorical or continuous, respectively. Both the ID3 and CART algorithm are greedy algorithms, meaning that they both choose the “best” feature locally that splits the data with the greatest discriminative power. In statistic and machine learning, a discriminator is a model that divides a set of classes or categories of items. So, the term discriminative power refers to how effective a discriminator is at categorising or dividing items correctly.

What is different about the CART and ID3 algorithms, however, are the functions used to calculate discriminative power. Referred to as the splitting criterion, the CART algorithm uses Gini impurity as a measure of non-homogeneity, whereas the ID3 algorithm uses information gain as its splitting criterion. While the difference between these two formulas will not be discussed in this notebook, you may choose to read more about them in-detail here.

from sklearn import tree
# Initialising with default parameters
classifier = tree.DecisionTreeClassifier()

The DecisionTreeClassifier is capable of performing multinomial (multi-class) classification on a dataset. We will initialise the decision tree learner with its default parameters for now, then attempt to optimise it by tweaking the criterion, max_depth, min_samples_split and min_samples_leaf. If you want to get a head start on what these parameters are, check out the official scikit-learn documentation here.

# Fitting the decision tree on training set
sklearnDigitsTree = classifier.fit(X_train, y_train)

Calling the fit() method on our digits training data will build a corresponding decision tree model. Without getting into the specifics of the exact implementation, our decision tree algorithm will run as follows:

  1. Select the best attribute which splits the data.
  2. Make that attribute a decision node and break the data into subsets.
  3. Recursively build branches for each child until one of the stopping crtieria are met:
    • There are no remaining samples to split, or
    • There are no remaining attributes to split on (all samples are of the same class).
# Visualising the tree using scikit-learn
from sklearn.tree import plot_tree, export_text

The following bit of code lets us display the resulting decision tree within our Jupyter notebook.

plt.figure(figsize=(20,20))
tree.plot_tree(sklearnDigitsTree)
plt.show()

Output 2. Scikit-learn DecisionTreeClassifier decision tree fit on digits dataset (default parameters).

There are other ways to visualise a decision tree, one of which involves the use of the Python graphviz library. Since we are writing our own model in the later part of the notebook, we will install the requirement here (now is also a good time to make sure that the pydot dependency is installed).

# Visualising the tree using graphviz
!pip install graphviz
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (0.10.1)

4. Test the classifier with the remaining test data and analyse it using the metrics packages of Scikit-learn (classification_report, confusion_matrix). What do you see?

Now, we will perform a prediction run over the test data using the decision tree that we previously fit. In essence, the test samples will be classified by sorting them down the tree from the root to a leaf/terminal node where a class label is provided. Each node in the tree acts as a test case for some attribute (i.e. an “if” statement), and each edge descending from the node corresponds to the possible answers to the test case (KDnuggets, 2020).

y_pred = classifier.predict(X_test)
from sklearn import metrics
from mlxtend.plotting import plot_confusion_matrix

Computing a confusion matrix and obtaining our model accuracy…

print('Scikit-learn Decision Tree Classifier on digits dataset... \n')
print('-'*10 + 'Classification Report' + '-'*5)
print(metrics.classification_report(y_test, y_pred))
print('-'*10 + 'Confusion Matrix' + '-'*10)
plot_confusion_matrix(metrics.confusion_matrix(y_test, y_pred))
Scikit-learn Decision Tree Classifier on digits dataset... 

----------Classification Report-----
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        64
           1       0.78      0.75      0.76        52
           2       0.93      0.85      0.89        67
           3       0.84      0.87      0.85        53
           4       0.80      0.90      0.85        41
           5       0.94      0.82      0.88        57
           6       0.86      0.91      0.88        46
           7       0.91      0.84      0.88        58
           8       0.68      0.81      0.74        52
           9       0.77      0.80      0.78        50

    accuracy                           0.85       540
   macro avg       0.85      0.85      0.85       540
weighted avg       0.86      0.85      0.85       540

----------Confusion Matrix----------

Output 3. Confusion matrix for Scikit-learn DecisionTreeClassifier on digits dataset. Great, we’ve classified our test set images using the fitted decision tree. As we saw in the confusion matrix, our model wasn’t perfect. We notice some misclassifications, especially with the 1 and 8 digit images. The following code will let us display a random sample of test images and their predicted labels.

def visualize_predictions(images, labels, examples_per_class):
    """
    Display a sample of randomly selected images and their predicted labels
    """
    #images_and_predictions = list(zip(images, labels))
    number_of_classes = len(np.unique(labels))
    for cls in range(number_of_classes):
        idxs = np.where(labels == cls)[0]
        idxs = np.random.choice(idxs, examples_per_class, replace=False)
        for i, idx in enumerate(idxs):
            plt.subplot(examples_per_class, number_of_classes, i * number_of_classes + cls + 1)
            plt.imshow(images[idx].astype('uint8'), cmap=plt.cm.gray_r, interpolation='nearest')
            plt.axis('off')
            if i == 0:
                plt.title('%s' % str(cls))
    plt.show()
# Reshape test data to 3D
X_test_images = X_test.reshape((len(X_test),8,8))
visualize_predictions(X_test_images, y_pred, examples_per_class=8)

Output 4. Random sample of Scikit-learn digits from each class 0-9 with predicted labels.

In the next section, we will modify a few of the DecisionTreeClassifier parameters and see if we can optimise our model.

Modifying the Scikit-learn DecisionTreeClassifier

5. Change the parameters of the classifier, e.g., the minimum number of samples in a leaf / for a split, to see how the tree and the results are affected.

Parameters modified:

We’ll first modify the Scikit-learn DecisionTreeClassifier by setting our splitting criterion (the function that measures discriminatory power) to entropy. This allows us to calculate information gain for each attribute rather than the Scikit-learn default of Gini impurity. The resulting classifier’s performance can be compared to our own ID3-algorithm we will be coding from scratch, which also uses information gain as its splitting criterion. This will be covered in the later part of the notebook. For now, let’s see how our results with the DecisionTreeClassifier are impacted by this change.

# Using information gain to determine best split
classifier = tree.DecisionTreeClassifier(criterion='entropy')
# Fitting decision tree on training set
id3esqueTree = classifier.fit(X_train, y_train)
# Visualising the new tree
plt.figure(figsize=(20,20))
tree.plot_tree(id3esqueTree)
plt.show()

Output 5. Scikit-learn DecisionTreeClassifier decision tree fit on digits dataset (using information gain).

# Predicting over test set
y_pred = id3esqueTree.predict(X_test)
print('Scikit-learn Decision Tree Classifier (using information gain) on digits dataset... \n')
print('-'*10 + 'Classification Report' + '-'*5)
print(metrics.classification_report(y_test, y_pred))
print('-'*10 + 'Confusion Matrix' + '-'*10)
plot_confusion_matrix(metrics.confusion_matrix(y_test, y_pred))
Scikit-learn Decision Tree Classifier (using information gain) on digits dataset... 

----------Classification Report-----
              precision    recall  f1-score   support

           0       0.97      0.95      0.96        64
           1       0.81      0.83      0.82        52
           2       0.90      0.82      0.86        67
           3       0.87      0.87      0.87        53
           4       0.73      0.85      0.79        41
           5       0.89      0.82      0.85        57
           6       0.90      0.96      0.93        46
           7       0.82      0.72      0.77        58
           8       0.71      0.75      0.73        52
           9       0.78      0.84      0.81        50

    accuracy                           0.84       540
   macro avg       0.84      0.84      0.84       540
weighted avg       0.84      0.84      0.84       540

----------Confusion Matrix----------

Output 6. Confusion matrix for Scikit-learn DecisionTreeClassifier on digits dataset (using information gain).

Great, we now have a reasonable model to compare to later on when we’ve constructed our own ID3-algorithm. Before we jump into writing our code, let’s try one more modification to the Scikit-learn DecisionTreeClassifier. The parameters that we will modify below are rather difficult to implement in our own version, so it’s best to test them out now.

Parameters modified:

There are a few reasons why we would want to modify the above parameters in practice. Let’s start with max_depth. The theoretical maximum depth of a decision tree is one less than the number of samples in the training set. A decision tree with this depth is, however, undesirable. The deeper the tree grows, the more complex the model becomes. The more attribute splits there are, the higher than chance of overfitting of the training data. Thus, reducing the maximum depth of the decision tree is one way to prevent overfitting.

def test_depth(X_train, X_test, y_train, y_test, max_depth=10):
    """
    Calculates the average F1 score over ten training runs for each model variation.
    Each model is tested with a max_depth from 10 to max_depth, inclusive.
    """
    f1_scores = {}
    for i in range(10, max_depth+1):
        f1s = []
        for exp in range(10):
            model = tree.DecisionTreeClassifier(max_depth=i)
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            # Multi-class F1 requires 'macro' or 'weighted' average
            # We choose 'macro' as our training set is relatively balanced
            f1s.append(metrics.f1_score(y_test, y_pred, average='macro'))
        f1_scores[i] = np.mean(f1s)
    return f1_scores
f1_scores = test_depth(X_train, X_test, y_train, y_test, max_depth=20)
f1_scores
{10: 0.8491836659477479,
 11: 0.8540618473168712,
 12: 0.8506428083925082,
 13: 0.8484501993680255,
 14: 0.8476100596481668,
 15: 0.8525831603915026,
 16: 0.8497516840174912,
 17: 0.8536172826836271,
 18: 0.8506231543841205,
 19: 0.8525277761148755,
 20: 0.8519916298862185}
plt.figure()
plt.plot(list(f1_scores.keys()), list(f1_scores.values()))
plt.title('Average F1 scores')
plt.xlabel('max_depth')
plt.ylabel('F1 score')
plt.show()

Output 7. Mean F1 scores of the Scikit-learn DecisionTreeClassifier for max_depth values between 10-20, inclusive. Scores were averaged over 10 different training-testing cycles for each parameter value.

# Get max_depth of model with highest average F1 score
best_depth = max(f1_scores, key=lambda x: f1_scores[x])
best_depth
11

To find the best performing model, we will use Scikit-learn’s GridSearchCV. This handy method allows us to test each hyperparameter combination on a given classifier and find the best combination from the parameter grid we specified. This works by exhaustively searching the parameter space using cross-validation method. One limitation to this approach is that we have to manually set a range of values to try for each hyperparameter. To better determine this, consult a research paper that uses a similar model and dataset. We chose our value ranges from the recommendations in this blog post on tuning decision trees.

from sklearn.model_selection import GridSearchCV
# Parameters to test (see above)
params = {'max_depth': range(10,21),
          'min_samples_split': range(1,41),
          'min_samples_leaf': range(1,21)}
# Perform cross-validated grid-search over parameter grid
grid = GridSearchCV(tree.DecisionTreeClassifier(),
                    # Parameters and their values to test in dict format 
                    param_grid=params,
                    # Cross-validation method (int=k folds for KFold)
                    cv=10,
                    # Number of jobs to run in parallel (-1 uses all processors)
                    n_jobs=-1,
                    # Print computation time for each fold and parameter candidate
                    verbose=1,
                    # Include training scores in cv_results_
                    return_train_score=True)

Forewarning: running GridSearchCV could take quite a long time (up to 15 minutes, in our case). Make sure that you have a decent computer or access to one. Google Colab provides a great (and free!) cloud service that handles this task fairly well.

# Perform the grid search and fit on training data
grid.fit(X_train, y_train)
Fitting 10 folds for each of 8800 candidates, totalling 88000 fits
  
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 230 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 1430 tasks      | elapsed:   17.0s
[Parallel(n_jobs=-1)]: Done 3430 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done 6230 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 9830 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 14230 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 19430 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 25430 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 32230 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 39830 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 48230 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 57430 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 67430 tasks      | elapsed: 11.9min
[Parallel(n_jobs=-1)]: Done 78230 tasks      | elapsed: 13.7min
[Parallel(n_jobs=-1)]: Done 88000 out of 88000 | elapsed: 15.4min finished

The following returns the combination of hyperparameter values that gave us the best results.

# Parameter setting that gave the best results
grid.best_params_
{'max_depth': 13, 'min_samples_leaf': 1, 'min_samples_split': 2}

It looks like our best max_depth value has changed slightly. Interestingly, the best reported values for min_samples_leaf and min_samples_split are the default ones in scikit-learn. Let’s see how the classifier performed with the hyperparameter combination…

# Mean cross-validated score of the best estimator
grid.best_score_
0.8767365079365079

Not bad! It looks like a bit of hyperparameter tuning was able to improve the stock DecisionTreeClassifier accuracy by several percentage points. Don’t get too excited, though. We have to still perform our train-test routine with this hyperparameter combination to get a better idea of the true predictive performance of the model. GridSearchCV only reports the mean cross-validated “score” during training and does not evaluate the performance on a test set.

Each parameter’s performance is returned in a nested dictionary object. Using the following method, we can visualise the mean score of each hyperparameter value.

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
def plot_search_results_plotly(grid):
    """
    Plot the mean scores per parameter evaluated with GridSearchCV.

    Params:
        grid: trained GridSearchCV object
    """
    # GridSearchCV results
    results = grid.cv_results_
    means_test = results['mean_test_score']
    stds_test = results['std_test_score']
    means_train = results['mean_train_score'] #
    stds_train = results['std_train_score'] #

    # Indices of hyperparameters
    masks = []
    mask_names = list(grid.best_params_.keys())
    for p_k, p_v in grid.best_params_.items():
        masks.append(list(results['param_'+p_k].data==p_v))
    params = grid.param_grid

    # Best parameter values
    best_values = grid.best_params_

    # Display results with Plotly
    fig = make_subplots(rows=1, cols=len(params), subplot_titles=mask_names, shared_xaxes=False, shared_yaxes=True)
    for i, p in enumerate(mask_names):
        m = np.stack(masks[:i] + masks[i+1:])
        best_params_mask = m.all(axis=0)
        best_index = np.where(best_params_mask)[0]
        x = np.array(params[p])
        y_1 = np.array(means_test[best_index])
        e_1 = np.array(stds_test[best_index])
        y_2 = np.array(means_train[best_index]) #
        e_2 = np.array(stds_train[best_index])  #
        fig.add_trace(go.Scatter(x=x, y=y_1, error_y=dict(type='data', array=e_1), legendgroup="test_group", name='test'), row=1, col=i+1)
        fig.add_trace(go.Scatter(x=x, y=y_2, error_y=dict(type='data', array=e_2), legendgroup="train_group", name='train'), row=1, col=i+1) #
    fig.update_layout(title_text='Scores per parameter for DecisionTreeClassifier', showlegend=True)
    fig.update_yaxes(title_text='Mean score', row=1, col=1)
    fig.show()
    pio.write_html(fig, file='cv_scores.html', auto_open=True)
plot_search_results_plotly(grid)
import joblib

# Save the GridSearchCV object
joblib.dump(grid, 'grid_search_cv.pkl')
best_classifier = tree.DecisionTreeClassifier(**grid.best_params_)
best_classifier
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=13, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
best_sklearnTree = best_classifier.fit(X_train, y_train)
# Visualise the modified tree
plt.figure(figsize=(30,30))
tree.plot_tree(best_sklearnTree)
plt.show()

Output 9. Scikit-learn DecisionTreeClassifier decision tree fit on digits dataset (max_depth=13, min_samples_split=2, min_samples_leaf=1).

y_pred = best_sklearnTree.predict(X_test)
print('Scikit-learn Decision Tree Classifier (with modified parameters) on digits dataset... \n')
print('classifier: %s\n' % best_classifier)
print('-'*10 + 'Classification Report' + '-'*5)
print(metrics.classification_report(y_test, y_pred))
print('-'*10 + 'Confusion Matrix' + '-'*10)
plot_confusion_matrix(metrics.confusion_matrix(y_test, y_pred))
Scikit-learn Decision Tree Classifier (with modified parameters) on digits dataset... 

classifier: DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=13, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

----------Classification Report-----
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        64
           1       0.77      0.85      0.81        52
           2       0.93      0.85      0.89        67
           3       0.81      0.87      0.84        53
           4       0.80      0.90      0.85        41
           5       0.92      0.82      0.87        57
           6       0.84      0.91      0.87        46
           7       0.96      0.78      0.86        58
           8       0.70      0.83      0.76        52
           9       0.82      0.80      0.81        50

    accuracy                           0.85       540
   macro avg       0.85      0.85      0.85       540
weighted avg       0.86      0.85      0.86       540

----------Confusion Matrix----------

Output 10. Confusion matrix for Scikit-learn DecisionTreeClassifier on digits dataset (max_depth=13, min_samples_split=2, min_samples_leaf=1).

2. Decision Tree Classifier using the ID3 algorithm

Objective #2: Implementing our own decision tree classifier based on the ID3 algorithm…

In the first part of this article, we framed classification as a kind of questioning-answering problem. We discussed how a decision tree naturally structures this type of hierarchial decision-making system. We then introduced the basic idea of the decision tree algorithm and lightly covered the general strategy of recursively building subtrees by splitting data into subsets based on an optimal attribute. To demonstrate this process, we applied a Scikit-learn DecisionTreeClassifier (implemented with CART) to the digits dataset. We saw what the resulting tree structure looked like, then measured its performance by classifying the test data subset. Lastly, we tuned a few hyperparameters like max_depth, which acts to minimise the depth of the tree (the number of ‘questions’ asked).

In this section, we will write our own algorithm from scratch, following the same general algorithm for decision trees we saw earlier. However, in our implementation we will be modifying the splititng criterion to follow the ID3 (Iterative Dichotomiser 3) algorithm. To review, the splitting criterion is a function used to measure which ‘questions’ provide the most value when it comes to separating the data. The ID3 algorithm selects the most useful attributes by introducing a metric known as information gain.

Information gain determines which questions (attributes) have the ‘most value’ by calculating their respective entropy reduction in terms of Shannon entropy. We can express this entropy calculation for a given discrete random variable \(X\) as \(H(X)\). At the highest level, information gain \(IG(T,a)\) can then be thought of as simple difference formula between the a prior Shannon entropy \(H(T)\) of data set \(T\) and the conditional entropy \(H(T\mid a)\) after splitting that dataset on attribute \(a\). Thus, information gain is expressed as

Formula for information gain as the difference between the a priori entropy of data set T and the conditional entropy of attribute a. where the conditional entropy is given as

Formula for conditional entropy of attribute a in training set T.

Given a discrete random variable \(X\) with possible outcomes \(x_{1},...,x_{n}\), which occur with probability \(P(x_{1}),...,P(x_{n})\), we can then define the Shannon entropy \(H(X)\) to be Formula for the entropy of discrete random variable X.

where \(\sum\) denotes the sum over the variable’s possible values. In our application, we will be choosing \(log\) to be of Base 2 with units of bits as “shannons”.

To build our decision tree branch-by-branch, we perform the above calculations in a four-step process at each recursive call:

  1. Calculate the prior entropy of the data set.
  2. Calculate the information gain for each attribute.
  3. Select the attribute with the maximal information gain.
  4. Partition the dataset based on the best attribute’s values.

Like the CART decision tree algorithm we covered in the first section of this article, we stop performing the above when any of the following conditions are met:

The ID3 algorithm in pseudocode

ID3 (Samples, Target_Attribute, Attributes)
  Create a (root) node Root for the tree

  If all samples belong to one class <class_name>
      Return the single-node tree Root, with label = <class_name>. 
  
  If Attributes is empty, then 
      Return the single node tree Root, with label = most common class value in Samples.
  else 
      Begin
          Let A be the attribute a in Attributes that generates the maximum information gain 
                when the tree is split based on a.

          Set A as the target_attribute of Root

          For each possible value, v, of A, add a new tree branch below Root, 
               corresponding to the test A == v, i.e.,
              Let Samples(v) be the subset of samples that have the value v for A.
              If Samples(v) is empty, then 
                  Below this new branch add a leaf node with label 
                        = most common class value in Samples. 
              else
                  Below this new branch add the subtree ID3 (Samples(v), A, Attributes/{A}) 
        End 
  Return Root

1. Make a decision regarding the data structure that your tree should be able to handle. In the code below, you will find the tree assumed to be implemented with nodes that are dictionaries.

We will be using the Handout_SkeletonDT code provided to us by E.A. Topp (EDAN95, link here) to help us get started on our ID3 decision tree algorithm. Examine the ID3DecisionTreeClassifierSkeleton class a bit to understand how we will implement nodes and visualise them in a graph using the pydot library.

from collections import Counter
from graphviz import Digraph



class ID3DecisionTreeClassifierSkeleton:


    def __init__(self, minSamplesLeaf = 1, minSamplesSplit = 2):

        self.__nodeCounter = 0

        # the graph to visualise the tree
        self.__dot = Digraph(comment='The Decision Tree')

        # suggested attributes of the classifier to handle training parameters
        self.__minSamplesLeaf = minSamplesLeaf
        self.__minSamplesSplit = minSamplesSplit


    # Create a new node in the tree with the suggested attributes for the visualisation.
    # It can later be added to the graph with the respective function
    def new_ID3_node(self):
        node = {'id': self.__nodeCounter, 'label': None, 'attribute': None, 'entropy': None, 'samples': None,
                         'classCounts': None, 'nodes': None}

        self.__nodeCounter += 1
        return node

    # adds the node into the graph for visualisation (creates a dot-node)
    def add_node_to_graph(self, node, parentid=-1):
        nodeString = ''
        for k in node:
            if ((node[k] != None) and (k != 'nodes')):
                nodeString += "\n" + str(k) + ": " + str(node[k])

        self.__dot.node(str(node['id']), label=nodeString)
        if (parentid != -1):
            self.__dot.edge(str(parentid), str(node['id']))
            nodeString += "\n" + str(parentid) + " -> " + str(node['id'])

        print(nodeString)

        return


    # make the visualisation available
    def make_dot_data(self):
        return self.__dot


    # For you to fill in; Suggested function to find the best attribute to split with, given the set of
    # remaining attributes, the currently evaluated data and target.
    def find_split_attr(self):

        # Change this to make some more sense
        return None


    # the entry point for the recursive ID3-algorithm, you need to fill in the calls to your recursive implementation
    def fit(self, data, target, attributes, classes):

        # fill in something more sensible here... root should become the output of the recursive tree creation
        root = self.new_ID3_node()
        self.add_node_to_graph(root)

        return root



    def predict(self, data, tree):
        predicted = list()

        # fill in something more sensible here... root should become the output of the recursive tree creation
        return predicted

2. Inspect other parts of the code provided. You will find one example for how it is easily possible to construct the visualisation data (dot-data) for the graphviz-visualisation in parallel to the actual decision tree. Whenever a node is added to the tree, it can also immediately be added to the graph. Feel free to use this for your own implementation.

In the above implementation we visualise our decision tree by calling the make_dot_data() method on a tree object that has been fitted (trained). We then wish to save a PDF of the visualised tree, which we can do by calling render() and passing in a desired filename.

# Initialise the decision tree model
clf = ID3()
# Fit on training data (build tree)
example_tree = clf.fit(train_data, train_labels)
# Construct pydot-node visualisation
plot = example_tree.make_dot_data()
# Save to PDF
plot.render("testTree")

In order to implement the desired functionality above, we need to install a few dependencies.

# Render PDF in image format to display tree inline
!pip install pdf2image
Collecting pdf2image
  Downloading pdf2image-1.16.0-py3-none-any.whl (10 kB)
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from pdf2image) (7.1.2)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.16.0
# Install poppler dependency
!apt-get install poppler-utils 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
# Check if poppler was installed successfully
!pdftoppm -h
pdftoppm version 0.62.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

Below is a method to display an image from a PDF in a Jupyter notebook cell.

from pdf2image import convert_from_path
from pathlib import Path
from PIL import Image
import matplotlib.pyplot as plt

def display_tree(filename):
    path = Path(filename)
    img_name = path.with_suffix('')
    pages = convert_from_path(path, dpi=200)
    
    for idx,page in enumerate(pages):
        img_path = Path(str(img_name) + '_page' + str(idx)).with_suffix('.jpg')
        page.save(img_path, 'JPEG')
        plt.figure()
        plt.title(img_name)
        img = Image.open(img_path)
        plt.imshow(img)
        plt.axis('off')
        plt.show()

3. Simply running main in the handout will produce a tree with one node, visualised in testTree.pdf. Make sure that this works, i.e., that you have all the necessary libraries installed.

In order to test our decision tree, we will first use a rather simple toy dataset. This dataset is split into attributes, classes, target and data. A description of each follows:

### From `ToyData.py` in E.A. Topp's `Handout_SkeletonDT`
from collections import OrderedDict

class ToyData:

    def __init__(self):
        self.attributes = OrderedDict(
            [("color", ["y", "g", "b"]), ("size", ["s", "l"]), ("shape", ["r", "i"])]
        )
        self.classes = ('+', '-')

        self.data = [('y', 's', 'r'),
                 ('y', 's', 'r'),
                 ('g', 's', 'i'),
                 ('g', 'l', 'i'),
                 ('y', 'l', 'r'),
                 ('y', 's', 'r'),
                 ('y', 's', 'r'),
                 ('y', 's', 'r'),
                 ('g', 's', 'r'),
                 ('y', 'l', 'r'),
                 ('y', 'l', 'r'),
                 ('y', 'l', 'r'),
                 ('y', 'l', 'r'),
                 ('y', 'l', 'r'),
                 ('y', 's', 'i'),
                 ('y', 'l', 'i')]
        self.target = ('+', '-', '+', '-', '+', '+', '+', '+', '-', '-', '+', '-', '-', '-', '+', '+')

        self.testData = [('y', 's', 'r'),
                 ('y', 's', 'r'),
                 ('g', 's', 'i'),
                 ('b', 'l', 'i'),
                 ('y', 'l', 'r')]

        self.testTarget = ('+', '-', '+', '-', '+')

    def get_data(self):
        return self.attributes, self.classes, self.data, self.target, self.testData, self.testTarget

As mentioned in Step 3, running the following code should produce a decision tree that contains a single node with id: 0. Nothing interesting can be said just yet about the resulting tree, but our goal is rather to test that our visualisation libraries are properly installed.

### From `main.py` in E.A. Topp's `Handout_SkeletonDT`
#import ToyData as td
#import ID3

import numpy as np
from sklearn import tree, metrics, datasets


def main():
	# Get data from ToyData
    attributes, classes, data, target, data2, target2 = ToyData().get_data()
    # Initialise decision tree model
    id3_test = ID3DecisionTreeClassifierSkeleton()
    # Fit the decision tree on ToyData (build tree)
    myTree = id3_test.fit(data, target, attributes, classes)
    print(myTree)
    # Visualise nodes using pydot
    plot = id3_test.make_dot_data()
    # Save to PDF
    plot.render("testTree")
    # Run a "prediction" over the test data
    predicted = id3_test.predict(data2, myTree)
    print(predicted)


if __name__ == "__main__": main()
# Visualise the tree and verify that a single node is produced
display_tree('testTree.pdf')

Output 11. ID3 Skeleton Classifier on ToyData dataset - testTree visualised (single node, id:0).

The ID3 Decision Tree Classifier

4. The code handout contains a mere skeleton for the ID3 classifier. Implement what is needed to actually construct a decision tree classifier. Implement the ID3 algorithm, e.g., according to what is provided in the lecture or on this page below. Use information gain as criterion for the best split attribute.

Now, the moment we have been waiting for. It’s finally time to write our very own ID3 Decision Tree Classifier. Follow the code below and you’ll be up and running in no time. The code is heavily documented with plenty of comments to help guide you through the implementation from top to bottom.

from collections import Counter, OrderedDict
from graphviz import Digraph
import math

class ID3DecisionTreeClassifier:
    def __init__(self, minSamplesLeaf = 1, minSamplesSplit = 2) :
        # The number of nodes in the tree
        self.__nodeCounter = 0
        # The graph to visualise the tree
        self.__dot = Digraph(comment='The Decision Tree')
        # Suggested attributes of the classifier to handle training parameters
        self.__minSamplesLeaf = minSamplesLeaf
        self.__minSamplesSplit = minSamplesSplit

    def new_ID3_node(self):
        """
        Create a new node in the tree with the suggested attributes for the visualisation.
        It can later be added to the graph with the respective function
        """
        # The node object implemented with a dictionary 
        node = {'id': self.__nodeCounter, 'label': None, 'attribute': None, 'entropy': None, 'samples': None,
                         'classCounts': None, 'nodes': None}

        # New key to store parent node's attribute value
        node.update({'attribute_value': None})
        # Incremement the counter by one for every new node created
        self.__nodeCounter += 1
        return node

    def add_node_to_graph(self, node, parentid=-1):
        """
        Create a dot-node for visualisation and print the node 
        """
        nodeString = ''
        for k in node:
            if ((node[k] != None) and (k != 'nodes')):
                nodeString += "\n" + str(k) + ": " + str(node[k])

        self.__dot.node(str(node['id']), label=nodeString)
        if (parentid != -1):
            self.__dot.edge(str(parentid), str(node['id']))
            nodeString += "\n" + str(parentid) + " -> " + str(node['id'])

        print(nodeString)
        return


    # make the visualisation available
    def make_dot_data(self):
        return self.__dot

    def _calc_entropy(self, node):
        """
        Return the entropy score of the data set prior to splitting.
        """
        entropy = 0.0
        # If all targets are of the same class, the entropy value is 0
        if len(node['classCounts'].values()) == 1:
            return entropy
        # Compute the weighted sum of the log of the probabilities
        for count in node['classCounts'].values():
            prob = count / node['samples']
            entropy -= prob * math.log2(prob)
        self._entropy = entropy
        return self._entropy

    def _calc_information_gain(self, target, attributes, node):
        """
        Returns the information gain of each attribute in attributes.
        Information gain is measured as the difference in entropy values calculated 
        before and after the split on the most optimal attribute.
        """
        info_gain = 0
        # Store information gain for each remaining attribute in attributes
        information_gains = {}
        for attribute in attributes:
            # Get entropy of parent node
            info_gain = node['entropy']
            # Store weighted average entropy
            weighted_entropy = 0.0
            # For every attribute value and its indexes in target
            for attribute_value, attribute_idxs in attributes[attribute].items():
                if attribute_idxs:
                    # Get the class labels of the attribute value by index
                    attribute_value_targets = {i:target[i] for i in attribute_idxs if i in target.keys()}
                    # Store the frequencies of each class label
                    classCounts = Counter(attribute_value_targets.values())
                    # Calculate the entropy of each child
                    child_entropy = 0.0
                    for count in classCounts.values():
                        prob = count / len(attribute_value_targets)
                        child_entropy -= prob * math.log2(prob)
                    # Calculate the weighted average entropy score if we were to split at this attribute
                    weighted_entropy += (len(attribute_value_targets) / node['samples']) * child_entropy
                    # Update summed info gain score for this attribute value
                    info_gain -= weighted_entropy
            # Update total information gain for this attribute
            information_gains[attribute] = info_gain
        self._information_gains = information_gains
        return self._information_gains
     
    def _get_best_split(self, target, attributes, node):
        """
        Returns the attribute that has the most predictive power (yields the most information if the data set was split based on that attribute's values). 
        This attribute has the highest information gain.
        """
        info_gains = self._calc_information_gain(target, attributes, node)
        best_split = max(info_gains, key=lambda x: info_gains[x])

        self._best_split = best_split
        return self._best_split


    def build_tree(self, data, target, attributes, classes, attr_val="", parentid=-1):
        """
        The recursive ID3-algorithm. On each iteration, the entropy and information gain of each attribute is calculated. In summary,
           1. Calculate the entropy of every attribute in attributes.
           2. Split the data into subsets using the attribute whose resulting information gain is maximised.
           3. Create a decision tree node using that attribute.
           4. Recursively perform steps 1-3 until the stopping conditions are satisfied.
        """
        # Count the frequencies of each class in target
        classCounts = Counter(target.values())
        
        # Create new node and update its values
        node = self.new_ID3_node()
        node.update({'samples': len(data)})
        node.update({'classCounts': classCounts})
        node.update({'entropy': self._calc_entropy(node)})
        node.update({'attribute_value': attr_val})

        # Create an empty list to store child nodes
        children = []
        
        # Stopping condition: no remaining attributes to split on
        if not attributes:
            # Update label of leaf node to be the name of the most frequent class
            # n=1 ensures that we return only the most common element in target
            # [0][0] ensures that we return the label rather than count
            node.update({'label': classCounts.most_common(1)[0][0]})
            # Add the node into the graph for visualisation
            self.add_node_to_graph(node, parentid)
            return node
        # Stopping condition: all samples in target are of the same class
        elif len(classCounts) == 1:
            # Update label of leaf node to be the name of the most frequent class
            node.update({'label': classCounts.most_common(1)[0][0]})
            # Add the node into the graph for visualisation
            self.add_node_to_graph(node, parentid)
            return node
        else:
            # Get the attribute to split on whose values have the highest information gain
            best_attribute = self._get_best_split(target, attributes, node)
            node.update({'attribute': best_attribute})
            # Add the node into the graph for visualisation
            self.add_node_to_graph(node, parentid)
            best_attribute_dict = attributes[best_attribute]

            # Recursively build branches by computing the best remaining attributes to split on
            for attribute_value, attribute_value_idxs in best_attribute_dict.items():
                # Remove the split attribute
                attributes_partitioned = attributes.copy()
                attributes_partitioned.pop(best_attribute, None)
                # Get the samples from data by index that contain the attribute value
                data_partitioned = {idx:data[idx] for idx in attribute_value_idxs if idx in data.keys()}
                # Get the class labels from target that map to the attribute value pairs
                target_partitioned = {idx:target[idx] for idx in attribute_value_idxs if idx in target.keys()}
                
                # If there are remaining attribute values to split
                if data_partitioned:
                    # Call recursive function to find the next best attribute(s) to split on
                    child = self.build_tree(data_partitioned, target_partitioned, attributes_partitioned, classes, attribute_value, node['id'])
                    # Append the new child node to the list of children
                    children.append(child)
                else:
                    # Create a leaf node
                    child = self.new_ID3_node()
                    # Update leaf node values
                    child.update({'label': classCounts.most_common(1)[0][0]})
                    child.update({'samples': len(attribute_value_idxs)})
                    #child.update({'samples': 0})
                    child.update({'attribute_value': attribute_value})
                    # Add the leaf node into the graph for visualisation
                    self.add_node_to_graph(child, node['id'])
                    children.append(child)
                    
            node.update({'nodes': children})
        return node

    def fit(self, data, target, attributes, classes, dataset=""):
        """
        The entry point for the recursive ID3-algorithm. Formats the input data and returns a root to the decision tree.
        """
        # Format the Scikit-learn digits dataset
        if dataset is not "ToyData":
            # Assume attributes is a list with the unique values of all attributes in data
            # Append the list of values as a tuple to every index in the list whose length is equal to the number of attributes in data
            attributes = [tuple(attributes) for i in range(data.shape[1])]
            # Create a dict of attribute values indexed by the number of attributes in data
            attributes = OrderedDict(zip(range(data.shape[1]), attributes))
            # Convert each sample in data (a list of attribute values) to a tuple and store it in a list
            data = [tuple(x) for x in data]
            # Convert the target labels into a tuple
            target = tuple(target)

        # Index the input data
        data_indexed = {i:data[i] for i in range(len(data))}
        target_indexed = {i:target[i] for i in range(len(target))}
        attributes_indexed = {attr:i for i, attr in enumerate(attributes.keys())}
        
        # Build nested attribute dict where each attribute value is paired with the indices in data where it occurs
        for i, attribute in enumerate(attributes.keys()):
            # Get the list of attribute values
            attribute_values = attributes[attribute]
            # Build dict to store each attribute value and an empty list for its indices in the data
            attribute_value_idx = {attribute_value:[] for attribute_value in attribute_values}
            # Go through each sample in the data
            for idx, x in data_indexed.items():
                # Get the value of the ith attribute in the sample
                x_value = x[i]
                # Append the sample's index to its matching attribute value in the dict
                attribute_value_idx[x[i]].append(idx)
            # Save the attribute's value-indicies mapping
            attributes[attribute] = attribute_value_idx

        # Build the decision tree using the ID3-algorithm and return its root node
        root = self.build_tree(data_indexed, target_indexed, attributes, classes)
        
        self._attributes_indexed = attributes_indexed
        self._root = root
        return self._root

    def _predict(self, x, node):
        """
        Helper function to predict the class label of a single instance in the test set by traversing the decision tree until a matching leaf node is found.
        """
        # The leaf node
        if node['label'] is not None:
            # Return its label as the predicted class for sample x
            return node['label']
        # The entry point into the tree traversal
        else:
            # Get the best split attribute at the current node
            best_attribute = node['attribute']
            # Return the value in the sample that belongs to the current split attribute
            attribute_value = x[self._attributes_indexed[best_attribute]]
            # Traverse through the branches
            for child in node['nodes']:
                # If the attribute value is found in the child node (a leaf)
                if attribute_value == child['attribute_value']:
                    # Call the method to return the leaf node label
                    return self._predict(x, child)

    def predict(self, data, tree, dataset=""):
        """
        Predicts the class labels of the test set sample-by-sample by traversing the decision tree built during training.
        """
        predicted = list()

        # Format the dataset
        if dataset is not "ToyData":
            data = [tuple(x) for x in data]
        for x in data:
            # Append the predicted label of x to the list
            predicted.append(self._predict(x, tree)) 
        
        self.y_pred = predicted
        return self.y_pred

5. Test your classifier with the toy example provided in the ToyData class given in the skeleton. In main you can also see how to make use of the dot-data to produce a visualisation with graphviz.

Ready to run our very own decision tree algorithm through some data? Let’s go!

# Get data from ToyData
attributes, classes, data, target, data2, target2 = ToyData().get_data()
# Initialise decision tree model
id3 = ID3DecisionTreeClassifier()
# Fit on training data (build tree)
toyTree = id3.fit(data, target, attributes, classes)
id: 0
attribute: size
entropy: 0.9886994082884974
samples: 16
classCounts: Counter({'+': 9, '-': 7})
attribute_value: 

id: 1
attribute: color
entropy: 0.8112781244591328
samples: 8
classCounts: Counter({'+': 6, '-': 2})
attribute_value: s
0 -> 1

id: 2
attribute: shape
entropy: 0.6500224216483541
samples: 6
classCounts: Counter({'+': 5, '-': 1})
attribute_value: y
1 -> 2

id: 3
label: +
entropy: 0.7219280948873623
samples: 5
classCounts: Counter({'+': 4, '-': 1})
attribute_value: r
2 -> 3

id: 4
label: +
entropy: 0.0
samples: 1
classCounts: Counter({'+': 1})
attribute_value: i
2 -> 4

id: 5
attribute: shape
entropy: 1.0
samples: 2
classCounts: Counter({'+': 1, '-': 1})
attribute_value: g
1 -> 5

id: 6
label: -
entropy: 0.0
samples: 1
classCounts: Counter({'-': 1})
attribute_value: r
5 -> 6

id: 7
label: +
entropy: 0.0
samples: 1
classCounts: Counter({'+': 1})
attribute_value: i
5 -> 7

id: 8
label: +
samples: 0
attribute_value: b
1 -> 8

id: 9
attribute: shape
entropy: 0.9544340029249649
samples: 8
classCounts: Counter({'-': 5, '+': 3})
attribute_value: l
0 -> 9

id: 10
attribute: color
entropy: 0.9182958340544896
samples: 6
classCounts: Counter({'-': 4, '+': 2})
attribute_value: r
9 -> 10

id: 11
label: -
entropy: 0.9182958340544896
samples: 6
classCounts: Counter({'-': 4, '+': 2})
attribute_value: y
10 -> 11

id: 12
label: -
samples: 3
attribute_value: g
10 -> 12

id: 13
label: -
samples: 0
attribute_value: b
10 -> 13

id: 14
attribute: color
entropy: 1.0
samples: 2
classCounts: Counter({'-': 1, '+': 1})
attribute_value: i
9 -> 14

id: 15
label: +
entropy: 0.0
samples: 1
classCounts: Counter({'+': 1})
attribute_value: y
14 -> 15

id: 16
label: -
entropy: 0.0
samples: 1
classCounts: Counter({'-': 1})
attribute_value: g
14 -> 16

id: 17
label: -
samples: 0
attribute_value: b
14 -> 17
# Visualise nodes using pydot
plot = id3.make_dot_data()
# Save to PDF
plot.render('myToyTree')
'myToyTree.pdf'
# Display PDF of tree inline
display_tree('myToyTree.pdf')

Output 12. ID3 Classifier on ToyData dataset - myToyTree visualised.

We successfully fit a decision tree on the ToyData dataset using our very own ID3 algorithm! In essence, our algorithm has figured out which ‘questions’ to ask about the attribute value pairs in the test data to identify whether or not a given example is either positive (+) or negative (-). The leaves in our tree (nodes at the bottom of the graph) are the predicted labels, or ‘answers’, for our input data. Once an input example reaches one of these leaf nodes, we assume it to be of the same class as the respective leaf node.

Predicting with a Decision Tree

The prediction (finding the class for an example x) with a decision tree boils then obviously down to a tree search, which follows the branch of the tree that represents the combinations of attribute values given in x until a leaf is reached. The predicted class for x is then the class that the leaf is labelled with. This is, again, easiest implemented recursively:

predict_rek( node, x)
    if node is leaf
        return the class label of node
    else
        find the child c among the children of node
             representing the value that x has for
             the split_attribute of node
        return predict_rek( c, x)
# Run prediction over test data
y_pred = id3.predict(data2, toyTree)
print('ID3 Decision Tree Classifier on ToyData dataset... \n')
print('-'*10 + 'Classification Report' + '-'*5)
print(metrics.classification_report(target2, y_pred))
print('-'*10 + 'Confusion Matrix' + '-'*10)
plot_confusion_matrix(metrics.confusion_matrix(target2, y_pred))
ID3 Decision Tree Classifier on ToyData dataset... 

----------Classification Report-----
              precision    recall  f1-score   support

           +       0.67      0.67      0.67         3
           -       0.50      0.50      0.50         2

    accuracy                           0.60         5
   macro avg       0.58      0.58      0.58         5
weighted avg       0.60      0.60      0.60         5

----------Confusion Matrix----------

Output 13. Confusion matrix for ID3 DecisionTreeClassifier on ToyData dataset.

Testing ID3 with Scikit-learn digits

6. When you are sure that everything works properly, run the ID3-training for the digits training data you used in part 1. Do not constrain the training, i.e., run with default parameters. What do you see in the plot? Analyse the result (produce a confusion matrix and a classification report) and compare with the result from part 1 (when running with default parameters).

# Split digits data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3)
# Get attributes and classes for digits data
attributes = list(np.unique(X_train))
# Convert to tuple format for ID3 classifier
classes = tuple(np.unique(y_train))
# Initialise decision tree model
id3 = ID3DecisionTreeClassifier()
# Fit on digits training data (build tree)
digitsTree = id3.fit(X_train, y_train, attributes, classes, dataset="digits")
# Visualise tree with Graphviz
plot = id3.make_dot_data()
# Render tree in PDF
plot.render("myDigitsTree")
'myDigitsTree.pdf'

Due to the size of the tree, we’re unable to render it on a single PDF page. Run the code for yourself to visualise the results :-)

# Prediction over digits test data
y_pred = id3.predict(X_test, digitsTree, dataset="digits")
print('ID3 Decision Tree Classifier on Scikit-learn digits dataset... \n')
print('-'*10 + 'Classification Report' + '-'*5)
print(metrics.classification_report(y_test, y_pred))
print('-'*10 + 'Confusion Matrix' + '-'*10)
plot_confusion_matrix(metrics.confusion_matrix(y_test, y_pred))
ID3 Decision Tree Classifier on Scikit-learn digits dataset... 

----------Classification Report-----
              precision    recall  f1-score   support

           0       0.33      0.30      0.32        53
           1       0.24      0.26      0.25        53
           2       0.30      0.29      0.29        49
           3       0.31      0.35      0.33        49
           4       0.37      0.31      0.34        61
           5       0.33      0.29      0.31        63
           6       0.31      0.50      0.38        38
           7       0.64      0.73      0.68        60
           8       0.13      0.12      0.12        58
           9       0.16      0.12      0.14        56

    accuracy                           0.32       540
   macro avg       0.31      0.33      0.32       540
weighted avg       0.32      0.32      0.32       540

----------Confusion Matrix----------

Output 14. Confusion matrix for ID3 DecisionTreeClassifier on Scikit-learn digits dataset.

# Reshape data to 3D for visualisation
dim = int(np.sqrt(X_test.shape[1]))
X_test_images = X_test.reshape((len(X_test), dim, dim))
# Visualise random examples from digits dataset
visualize_random(X_test_images, y_test, examples_per_class=8)

Output 15. Random sample of Scikit-learn digits from each class 0-9 with actual labels.

# Convert predictions into numpy array
y_pred = np.array(y_pred)
# Visualise the images and their predicted labels
visualize_predictions(X_test_images, y_pred, examples_per_class=8)

Output 16. Random sample of Scikit-learn digits from each class 0-9 with predicted labels.

Testing ID3 with modified (summarised) Scikit-learn digits

7. One striking difference should be in the ratio of breadth and depth of the two trees. Why is that the case? Modify your data set to contain only three values for the attributes (instead of potentially 16), e.g., ‘dark’, ‘grey’, and ‘light’, with for example ‘dark’ representing pixel values <5.0, and ‘light’ those >10.0. Train and test the classifier again. Do your results improve? Can you match the SKLearn implementation’s accuracy? If not, why do you think this is the case?

Bins:

bin_bounds = [4, 10]
digits_data_summarised = np.digitize(digits.data, bins=bin_bounds, right=True)
digits_target_summarised = digits.target
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(digits_data_summarised, digits_target_summarised, test_size=0.3)
# Get attributes and classes for digits data
attributes_bin = list(np.unique(X_train_bin))
# Convert to tuple format for ID3 classifier
classes_bin = tuple(np.unique(y_train_bin))
# Initialise decision tree model
id3 = ID3DecisionTreeClassifier()
# Fit on digits summarised data (build tree)
digitsSummarisedTree = id3.fit(X_train_bin, y_train_bin, attributes_bin, classes_bin, dataset="digits_summarised")
# Visualise tree with Graphviz
plot = id3.make_dot_data()
# Render tree in PDF
plot.render("myDigitsSummarisedTree")

Due to the size of the tree, we’re unable to render it on a single PDF page. Run the code for yourself to visualise the results :-)

# Prediction over digits summarised test data
y_pred = id3.predict(X_test_bin, digitsSummarisedTree, dataset="digits_summarised")
print('ID3 Decision Tree Classifier on Scikit-learn digits (summarised) dataset... \n')
print('-'*10 + 'Classification Report' + '-'*5)
print(metrics.classification_report(y_test_bin, y_pred))
print('-'*10 + 'Confusion Matrix' + '-'*10)
plot_confusion_matrix(metrics.confusion_matrix(y_test_bin, y_pred))
ID3 Decision Tree Classifier on Scikit-learn digits (summarised) dataset... 

----------Classification Report-----
              precision    recall  f1-score   support

           0       0.84      0.72      0.78        53
           1       0.61      0.70      0.65        56
           2       0.73      0.68      0.70        59
           3       0.64      0.69      0.66        51
           4       0.71      0.78      0.74        63
           5       0.69      0.72      0.71        50
           6       0.67      0.67      0.67        54
           7       0.88      0.86      0.87        49
           8       0.51      0.39      0.44        56
           9       0.56      0.63      0.60        49

    accuracy                           0.68       540
   macro avg       0.68      0.68      0.68       540
weighted avg       0.68      0.68      0.68       540

----------Confusion Matrix----------

Output 17. Confusion matrix for ID3 DecisionTreeClassifier on Scikit-learn digits (summarised) dataset.

# Reshape the test data to 3D for visualisation
dim = int(np.sqrt(X_test_bin.shape[1]))
X_test_bin_images = X_test_bin.reshape((len(X_test_bin),dim,dim))
# Visualise random images from digits summarised dataset
visualize_random(X_test_bin_images, y_test_bin, examples_per_class=8)

Output 18. Random sample of Scikit-learn digits from each class 0-9 with actual labels.

# Convert predictions into numpy array
y_pred = np.array(y_pred)
# Visualise the images and their predicted labels
visualize_predictions(X_test_bin_images, y_pred, examples_per_class=8)

Output 19. Random sample of Scikit-learn digits from each class 0-9 with predicted labels.

8. (Bonus: If interested, explore the effects of different parameters regulating the depth of the tree, the maximum number of samples per leaf or required for a split, initially on the SKLearn version, but of course you can also implement them for your own classifier.) We’ll leave this exercise to the reader.

Congrats–you’ve reached the end of this article! By now, you should be familiar with decision trees for classification. You explored two popular decision tree algorithms: CART and ID3. You looked at the math behind the ID3 algorithm and wrote formulas to perform attribute selection using the information gain metric. You ran the CART and ID3 algorithms on data and measured their performance on multinomial classification. Lastly, you improved their results by (1) tweaking the hyperparameters and (2) modifying the data’s attribute values. While there is always more content to explore, we hope that this has been a fun and informative introduction for you.

Credits

This assignment was prepared by the EDAN95 HT2019 teaching staff at Lunds Tekniska Högskola (LTH). The skeleton code for the ID3-algorithm and ToyData dataset was authored by E.A. Topp (bio here).

Most of the code from the ID3 algorithm was inspired by A. Sears-Collins’ Iterative Dichotomiser 3 Algorithm in Python.

Additional credits:

2019-11-15 00:00:00 -0800 - Written by Jonathan Logan Moran


Page design by Ankit Sultana