Foundations of Exploratory Data Mining

The rate at which research labs, enterprises and governments accumulate data is high and fast increasing. Often, these data are collected for no specific purpose, or they turn out to be useful for unanticipated purposes: Data generated by the Internet of Things, and data available through increasingly numerous open data initiatives create unfathomable opportunities for improving our understanding of our environment; Companies constantly look for new ways to monetize their customer databases; Governments mine various databases to detect tax fraud; Security agencies mine and cross-associate numerous heterogeneous information streams from publicly accessible and classified databases to understand and detect security threats; etc.

Most of these tasks are examples of Exploratory Data Mining (EDM) tasks, where the objective cannot be clearly defined beforehand: it is unclear how to formalize how interesting the patterns extracted from the data are. Unfortunately, this means that EDM is often a slow process of trial and error.

In this research topic we aim remediate this by developing the mathematical principles of what makes a pattern interesting in a very subjective sense. Crucial in this endeavour is research into automatic mechanisms to model and duly consider the prior beliefs and expectations of the user for whom the EDM patterns are intended, thus relieving the users of the complex task to attempt to formalize themselves what makes a pattern interesting to them.

This approach contrasts with the prevailing manner in which EDM research is done: Researchers typically imagine a specific purpose for the patterns, try to formalize the interestingness of such patterns given that purpose, and design an algorithm to extract them. However, given the variety of users, this strategy has led to a multitude of algorithms. As a result, users need to be data mining experts to understand which algorithm applies to their situation. To resolve this, we are aiming to develop a theoretically solid framework for the design of EDM systems that model the user's beliefs and expectations as much as the data itself, so as to maximize the amount of useful information transmitted to the user. This will ultimately bring the power of EDM within reach of the non-expert.

Finally, we aim to incorporate our theoretical results into highly usable algorithms, to be subsequently deployed in real-life applications from web and social media mining, bioinformatics, and more.

Staff

Tijl De Bie, Jefrey Lijffijt

Researchers

Florian Adriaens, Xi Chen, Junning Deng, Bo Kang, Yalavarthi Vijaya Krishna, Alexandru Cristian Mara, Ahmad Mel, Robin Vandaele.

Projects

ERC Consolidator Grant FORSIED: “Formalising Subjective Interestingness in Exploratory Data Mining”.

Odysseus Grant “Exploring Data: Theoretical Foundations and Applications to Web, Multimedia, and Omics Data”.

FWO grant: "Data mining without spilling the beans: preserving more than privacy alone".

Marie Skłodowska-Curie Fellowship (Jefrey Lijffijt): "Personalised, interactive, and visual exploratory mining of patterns in complex data".

Key publications

van Leeuwen, Matthijs, Tijl De Bie, Eirini Spyropoulou, and Cedric Mesnage. 2016. “Subjective Interestingness of Subgraph Patterns.” Machine Learning 105:41.

Puolamäki, Kai, Bo Kang, Jefrey Lijffijt, and Tijl De Bie. 2016. “Interactive Visual Data Exploration with Subjective Feedback.” In European Conference, ECML PKDD 2016, Riva Del Garda, Italy, September 19-23, 2016, Proceedings, Part II, Lecture Notes in Computer Science, 9852:214–229. Springer International Publishing.

Kontonasios, Kleanthis-Nikolaos, and Tijl De Bie. 2015. “Subjectively Interesting Alternative Clusterings.” Machine Learning 98 (1-2): 31–56.

Lijffijt, Jefrey, Eirini Spyropoulou, Bo Kang, and Tijl De Bie. 2016. “P-N-RMiner: A Generic Framework for Mining Interesting Structured Relational Patterns.” International Journal of Data Science and Analytics 1: 61.

Spyropoulou, Eirini, Tijl De Bie, and Mario Boley. 2014. “Interesting Pattern Mining in Multi-relational Data.” Data Mining and Knowledge Discovery 28 (3): 808–849.

De Bie, Tijl. 2011. “An Information Theoretic Framework for Data Mining.” In 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Proceedings, 564–572. ACM.

De Bie, Tijl. 2011. “Maximum Entropy Models and Subjective Interestingness: An Application to Tiles in Binary Databases.” Data Mining and Knowledge Discovery 23 (3): 407–446.

Our dense subgraph mining algorithm finds interesting communities in a social network of music bands, each of which corresponds to a particular music genre.

Our RMiner algorithm for relational pattern mining identifies a pattern involving 12 James Bond movies as the most interesting association in the database. The algorithm was not made aware of the fact that these movies are part of the same franchise.