Decision Tree

It is one of the most widely used classification techniques. It is simple, it can be trained with a limited number of examples, it is understandable and works well with categorical attributes.

The usage of this model is characterized by a set of questions (yes/no), which build the tree. The idea is that the number of possible decision trees is exponential and we are looking for the best one (the one that creates the most accurate representation).

All the classification algorithms are systems that work in a multidimensional space ans try to find some regions that have the same types of object (belonging to the same class).

Learning the Model

Many algorithms are available, but we will use C4.5.

The Haunt's Algorithm It is a recursive approach that progressively subdivides a set of Dt records into purely pure record sets.

Procedure to follow:

If Dt contains records belonging to the yj class only, then it is a lea node with label yj
If Dt is an empty set, then t is a leaf node to which a parent node class is assigned
If Dt contains records belonging to several classes, you choose an attribute and a split policy to partition the records into multiple subsets.
Apply recursively the current procedure to each subset

TreeGrowth(E,F)
    if StoppingCond(E,F) = TRUE then
        leaf = CreateNode()
        leaf.label = Classify(E) ;
        return leaf;
    else:
        root = CreateNode();
        root.test cond = FindBestSplit(E,F) ;
        let V = {V | v is a possible outcome of root.test_cond}
    for each v ∈ V do
        E = {e | root.test cond(e)=v and e ∈ E}
        child = TreeGrowth(E,F);
        add child as descendant of root and label edge
    end for
    end if
        return root;
    end;

Characteristic Feature

Starting from the basic logic to completely define an algorithm for building decision trees, it is necessary to define:

The split condition (depends on the type of attribute and on the number of splits)
- Nominal (N-ary split vs binary split)
- Ordinal (partitioning should not violate the order sorting)
- Continuous (the split condition can be expressed as a Boolean with N-ary split and as a binary comparison test with binary-split)
  - Static (discretization takes place only once before applying the algorithm)
  - DYnamic (discretization takes place at each recursion)
The criterion defining the best split (it must allow you to determine more pure classes, using a measure of purity)
The criterion for interrupting splitting (AND conditions, if one applies, the splitting stops)
- When all its records belong to the same class
- When all its records have similar values on all attributes
- When the number of records in the node is below a certain threshold
- When the selected criterion would not be statistically relevant
Methods for evaluating the goodness of a decision tree

Metrics for Model Evaluation

Confusion Matrix evaluates the ability of a classifier based on the following indicators:

TP (true positive)
FN (false negative)
FP (false positive)
TN (true negative)

Accuracy is the most widely used metric to synthesize the information of a confusion matrix

Accuracy Limitations

Accuracy is not an appropriate metric if the classes contain a very different number of records.

Precision and Recall are two metric used in applications where the correct classification of positive class records is more important

Precision measures the fraction of record results actually positive among all those who were classified as such
Recall measures the fraction of positive records correctly classified

F-measure is a metric that summarizes precision and recall

Cost-Based Evaluation Accuracy, precision, recall and F-measure classify an instance as positive if P(+,i) > P(-,i). They assume that FN and FP have the same weight, thus they are cost-intensive, but in many domains this is not true.

ROC Space (Receiver Operator Characteristics)

Roc graphs are two-dimensional graphs that depict relative tradeoffs between benefits (TP) and costs (FP) induced by a classifier. We distinguish between:

Probabilistic classifiers return a score that is not necessarily a sensu strictu probability but represents the degree to which an object is a member of one particular class rather than another one
Discrete classifier predicts only the classes to which a test object belongs

Classification Errors

Training error: mistakes that are made on the training set
Generalization error: errors made on the test set
Underfitting: the model is too simple and does not allow a good classification or set training or test set
Overfitting: the model is too complex, it allows a good classification of the training set, but a poor classification of the test set
- Due to noise (the boundaries of the areas are distorted)
- Due to the reduced size of the training set

How to handle overfitting

Pre-pruning: stop splitting before you reach a deep tree. A node can not be split further if:
- Nodes does not contain instances
- All instances belong to the same class
- All attributes have the same values
Post-pruning: run all possible splits to reduce the generalization error

Post-pruning is more effective but involves more computational cost. It is based on the evidence of the result of a complete tree.

Estimate Generalization Error

A decision tree should minimize the error on the real data set, unfortunately during construction, only the training set is available.

The methods for estimating the generalization error are:

Optimistic approach
Pessimistic approach
Minimum Description Length (choose the model that minimizes the cost to describe a classification)
Using the test set

Building the Test Set

Holdout: use 2/3 of training records and 1/3 for validation
Random subsampling: repeated execution of the holdout method in which the training dataset is randomly selected
Cross validation: partition the records into separate k subdivisions, run the training on k-1 divisions and test the reminder, repeat the test k times and calculate the average accuracy
Bootstrap: The extracted records are replaced and records that are excluded form the validation set. This method does not create a new dataset with more information, but it can stabilize the obtained results of the available dataset.

C4.5 (J48 on Weka)

This algorithm exploits the GainRatio approach. It manages continuous attributes by determining a split point dividing the range of values into two. It manages data with missed values and run post pruning of the created tree.

Last update: January 15, 2023 12:20:58
Created: October 4, 2022 09:54:12

Authors: Francesca Neri