Learning when data sets are imbalanced and when costs are unequal and unknown

Marcus A. Maloof

The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from ROC analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these results to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classifiers that fell on the same ROC curve.

Paper available in PostScript (gzipped) and PDF.

@inproceedings{maloof.icmlw.03,
  author = "Maloof, M.A.",
  title = "Learning when data sets are imbalanced and when costs are
    unequal and unknown",
  booktitle = "{ICML-2003 Workshop on Learning from
    Imbalanced Data Sets II}",
  year = 2003,
  url = "http://www.site.uottawa.ca/~nat/Workshop2003/workshop2003.html",
  annote = {
  }}