Online density estimates : a probabilistic condensed representation of data for knowledge discovery

Date issued

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

ItemDissertationOpen Access

Abstract

The Internet of Things (IoT) and the data that is generated from its sensors are making new demands on data mining methods. These demands stem from the desire to benefit from the knowledge contained in this data and the increasing number of devices that are equipped with these sensors. According to companies like Intel or HP, the number of sensors worldwide is likely to reach more than one trillion by 2022. All of them will produce streams of measurements and leveraging knowledge from these streams requires infrastructure to analyze them in real-time. From a data mining perspective, this involves challenging tasks such as cleaning the data, handling large amounts of data, and preserving their privacy, to name a few. The state of the art in data mining already addressed some of these challenges, but the proposed methods are typically designed for a specific task (e.g., predicting a certain variable or finding frequent patterns) and perform this task while scanning the data stream. However, at the time of collecting the data, it is often not known what kind of analysis needs to be performed or there are several -- possibly even dependent -- analysis tasks. This means that whenever storing the original data is either not feasible due to the sheer volume or impossible due to privacy concerns, the user has to wait for more data to initiate another analysis task, which impedes the use of conventional data mining algorithms. Therefore, we present a framework in this thesis, called MiDEO (Mining Density Estimates inferred Online), which decouples the process of collecting the data from the actual analysis. It uses density estimates to maintain a compact representation of the data stream and provides inference capabilities to perform queries on them. The queries can be combined to complex data mining tasks and allow to adapt the estimates to the current needs of the user or the algorithm. Compared to current methods that typically focus on one task at a time, this enables a more interactive analysis of the data stream, where the task selection is part of the analysis. In the course of designing such a framework, we develop several methods to improve the state of the art. This includes online density estimators for conditional joint densities with mixed types of variables, an online density estimator for high-dimensional data, algorithms to perform pattern mining on online density estimates, an online density estimator that is able to represent recurrences in the data stream, and algorithms that enforce well-known privacy-preserving properties to protect the entities described by the data. To show the effectiveness of these methods, we prove some of their theoretical properties and perform an extensive set of experiments.

Description

Keywords

Citation

Relationships