Abstract:
Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world structured data suffer from different types of discrepancies, such as missing values, outliers, duplicates, patter violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced from the data management community and the ML community. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in the ML pipeline. To this end, the benchmark examines several simple and advanced error detection and repair methods. To evaluate these methods, we utilize publicly-available datasets covering different domains together with a wide collection of ML models.
About the presenter:
Mohamed Abdelaal is a Research Scientist at Software AG in Stuttgart, specializing in the intersection of IoT and machine learning. He holds a Ph.D. in Computer Science from Carl von Ossietzky Universität Oldenburg. His academic journey has been distinguished with multiple awards, including the Mobinil Excellence Award and recognitions from Port Said and Suez Canal Universities. Beyond his extensive research experience, Mohamed has a notable record of publications in energy efficiency and sensor networks.