Jeremy Z. Kolter and Marcus A. Maloof

In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the ROC curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.

© ACM, 2004. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2004) http://doi.acm.org/10.1145/1014052.1014105

Paper available in PDF.

@inproceedings{kolter.kdd.04,
  author = "Kolter, J. Z. and Maloof, M. A.",
  title = "Learning to detect malicious executables in the wild",
  booktitle = "{Proceedings of the Tenth ACM SIGKDD International Conference
    on Knowledge Discovery and Data Mining}",
  pages = "470--478",
  year = 2004,
  publisher = "ACM Press",
  address = "New York, NY",
  note = "Best Application Paper",
  annote = {
  }}