The CART Algorithm#

The algorithm used for classification trees is called CART. It stands for Classification and Regression Tree.

Gini the metric#

There are metrics entering in the optimization algorithm of a decision tree. The main one is the Gini impurity:

Definition 47

The Gini’s diversity index is a measure of a node’s impurity.

It is defined for each tree node \(i\) as:

(50)#\[\begin{equation} G_i = 1 - \sum_{k=1}^{N_\text{classes}} \left( \frac{N_{k, i}}{ N_i} \right)^2 = 1 - \sum_{k=1}^{N_\text{classes}} \left( p_k \right)^2 \end{equation}\]

with \(N_{k, i}\) the number of data samples of class \(k\) in node \(i\) and \(N_{i}\) the total number of data samples in node \(i\). The terms \(p_k\) in the sum are equivalent to the probability of getting a sample of class \(k\) in the node \(i\).

The Gini’s impurity index ranges from 0 (100% pure node) to 1 (very impure node).

If a node has an equal amount of data points from two classes, let’s say 500 points each, the Gini index will be:

(51)#\[\begin{equation} G_i = 1 - \left( \frac{500}{1000} \right)^2 - \left( \frac{500}{1000} \right)^2 = 0.5 \end{equation}\]

With three classes present in equal quantities, it will be \(\frac{2}{3}\), for four, \(\frac{3}{4}\), etc.

The cost function in CART#

Once again, we will need a cost function to minimize. This cost function will be defined at a given node containing \(N_\text{node}\) data samples.

Definition 48

The cost function for CART is defined as:

(52)#\[\begin{equation} C(j, t_j) = \frac{N_\text{left}}{N_\text{node}} G_\text{left} + \frac{N_\text{right}}{N_\text{node}} G_\text{right} \end{equation}\]

It is a sum where the purity of each subset is weighted by the subset sizes. In the literature, you may see the terms left and right. They translate as:

  • left: all data points whose feature \(x_j\) is such that \(x_j < t_j\)

  • right: all data points whose feature \(x_j\) is such that \(x_j \geq t_j\)

We will see the role of the hyperparameters later. For now, let’s now go through the steps of the algorithm.

How CART works#

Algorithm 3 (Classification and Regression Tree)

Inputs

  • Training data set \(X\) of \(m\) samples with each \(n\) input features, associated with their targets \(y\)

\[\begin{equation*} X = \begin{pmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_j^{(1)} & \cdots & x_n^{(1)} \\[2ex] x_1^{(2)} & x_2^{(2)} & \cdots & x_j^{(2)} & \cdots & x_n^{(2)} \\ \vdots & \vdots & \ddots & \vdots & & \vdots \\ x_1^{(i)} & x_2^{(i)} & \cdots & x_j^{(i)} & \cdots & x_n^{(i)} \\ \vdots & \vdots & & \vdots & \ddots & \vdots \\ x_1^{(m)} & x_2^{(m)} & \cdots & x_j^{(m)} & \cdots & x_n^{(m)} \\ \end{pmatrix} \hspace{10ex} y = \begin{pmatrix} y^{(1)} \\[2ex] y^{(2)} \\[2ex] \vdots \\ y^{(i)}\\ \vdots \\[2ex] y^{(m)}\end{pmatrix} \end{equation*}\]

Hyperparameters

  • Max Depth

  • Minimum sample split

  • Minimum samples per leaf,

  • Maximum leaf nodes
    (non-exhaustive list)

Outputs
A collection on decision boundaries segmenting the \(j\) feature space.

Initialization: at the root node

STEP 1: THRESHOLD COMPUTATION:

For each feature \(j\) from 1 to \(n\):
 
\(\;\;\;\) For a value of \(t_j\) scanning the range \(x^\text{min}_j\) to \(x^\text{max}_j\):
 
\(\;\;\;\;\;\; \rightarrow\) Compute the cost function \(C(j, t_j)\)
 
\(\;\;\;\;\;\; \rightarrow\) Find the pair \((j, t_j)\) that produces the purest subsets, i.e. so that it minimizes:

(53)#\[\begin{equation} \min_{(j, \: t_j)} C(j, t_j) \end{equation}\]

STEP 2: BRANCHING:
Split the dataset along the feature \(j\) using the threshold \(t_j\) into two subsequent new nodes.

Repeat Step 1 at each new node.

Exit conditions
At least one condition on one of the hyperparameters is fullfilled

One of the greatest strength of decision tree is the fact very few assumptions about the training data are made. It differs from linear or polynomial models where we need to know beforehand that the data can be fit with either linear or polynomial function.

Another advantage: feature scaling is not necessary. No need to standardize the dataset beforehand.

Decision trees are also easy to interpret. It’s possible to check calculations and apply the classification rules even manually. Such clarity in the algorithm, often called white box models, contrasts with the black boxes that are for instance neural networks.

Despite these advantages, decision trees have several limitations.