Overfitting in induction classification trees

Overfitting in induction classification trees

  • boundaries created to separate different classes in data
    • like lines drawn parallel to the axes of the features
  • goal \to make these boundaries so that each class is completely separated.
    • every point in one class is on one side of the boundary, and every point in the other class is on the other side.
  • Induction process continues until all subsets associated with leaf nodes have an entropy of 0, i.e. are perfectly homogeneous
  • can lead to overfitting
  • Outliers and noise will be in leaves, having very small data subsets Pasted image 20231109092652.png 1 EFBVZvHEIoMdYHjvAZg8Zg.gif

Pruning: overfitting solution

  • see also: pruning in neural networks
  • Tests in deepest levels are replaced with leaf node
  • Subsets associated with the test's leaf nodes are combined
  • Class of leaf node is class that occurs most in the new subset
  • If generalization performance improves, then pruning is affected
  • If generalization performance does not improve, then test is re-instated
  • Makes decision trees robust to outliers and noise

Overfitting from information gain

  • Favours input variables with many outcomes
  • results in many branches
    • \therefore many decision rules
    • \therefore a complex classifier
  • For input variables with unique values: one branch per value Solution to this bias:
  • Normalize the information gain with entropy with respect to test outcomes
  • Gain ratio is computed as  gainRatio (x)=gain(x)splitlnfo(x)\text { gainRatio }(x)=\frac{\operatorname{gain}(x)}{\operatorname{split\operatorname {lnfo}(x)}}, where splitInfo(x)=o=1Opologopo\text{splitInfo}(x)=-\sum_{o=1}^O p_o \log o p_o
  • Objective to select xx that maximizes the gain ratio

© 2024 All rights reserved

Built with DataHub LogoDataHub Cloud