Informatik, TU Wien

On the Evaluation of Unsupervised Outlier Detection

Measures, Datasets, and an Empirical Study

Abstract

The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods.

 

In this talk, we discuss our recent research results: we performed an extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.

Biography

Dr. Arthur Zimek is a Privatdozent in the database systems and data mining group at the Ludwig-Maximilians-Universität München (LMU), Germany. 2012-2013 he was a postdoctoral fellow in the department for Computing Science at the University of Alberta, Edmonton, Canada. In 2014 he was a visiting professor at Technical University Vienna, Austria. He holds degrees in bioinformatics, philosophy, and theology, involving studies at universities in Munich, Mainz (Germany), and Innsbruck (Austria) and finished his Ph.D. thesis in informatics on ''Correlation Clustering'' at LMU in summer 2008. For this work, Zimek received the ''SIGKDD Doctoral Dissertation Award (runner-up)'' in 2009. His research interests include ensemble techniques, clustering, and outlier detection, methods as well as evaluation, and high dimensional data. Zimek published more than 60 papers at peer reviewed conferences and in international journals. Together with his co-authors, he received the ''Best Paper Honorable Mention Award'' at SDM 2008 and the ''Best Demonstration Paper Award'' at SSTD 2011. Zimek has been a member or senior member of program committees of the leading data mining conferences (e.g. SIGKDD, ECMLPKDD, CIKM, SDM) and serves as reviewer for journals like ACM TKDD, IEEE TKDE, Data Mining and Knowledge Discovery (Springer), Machine Learning (Springer).

Note

This talk is part of the lecture series on research talks by the visiting professors of the Vienna PhD School of Informatics.