We connect researchers and practitioners to present original research and provide an overview of established methods. To the benefit of all, we invite both regional and international speakers to join us. To learn more about the project, please visit kompaki.de.

The presentations plus discussions are typically 20-60 minutes long and are recorded. Below, you will find an overview of the talks and links to recordings, where available.

Restore: Neural Data Completion for Tabular Data

Benjamin Hilprecht (Technische Universität Darmstadt)

Watch recording

Abstract: Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete. In this talk, we present an automated approach for relational data completion called ReStore using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering.

Image of Benjamin Hilprecht

About the presenter: Benjamin Hilprecht is very interested in bringing together machine learning and database systems. Having completed his Doctor of Science in Computer Science from Technische Universität Darmstadt, Hilprecht has worked extensively in data management, leading to his recognition as the runner-up for the Jim Gray Dissertation Award by Sigmod in 2023.

An Ontology-Based Concept for Meta AutoML

Alexander Zender (Hochschule Darmstadt)

Watch recording

Abstract: Automated machine learning (AutoML) supports ML engineers and data scientists by automating tasks like model selection and hyperparameter optimization. A number of AutoML solutions have been developed, open-source and commercial. We propose a concept called OMA-ML (Ontology-based Meta AutoML) that combines the strengths of existing AutoML solutions by integrating them (meta AutoML). OMA-ML is based on a ML ontology that guides the meta AutoML process. It supports multiple user groups, with and without programming skills. By combining the strengths of AutoML solutions, it supports any number of ML tasks and ML libraries.

Image of Alexander Zender

About the presenter: Alexander Zender holds a Master’s degree in Computer Science. Since 2021, he has pursued a doctorate at the Darmstadt University of Applied Sciences (h_da). His main research interests are bringing together AutoML and Ontologies into unified systems. Having published on this topic, he also investigates applied topics, where user interactions and natural language processing play a significant role.

Whittle Networks: A Deep Likelihood Model for Time Series

Fabrizio Ventola (Technische Universität Darmstadt)

Watch recording

Abstract: While probabilistic circuits have been extensively explored for tabular data, less attention has been paid to time series. In such scenario, the goal is to estimate joint densities among the entire time series and, in turn, determining, for instance, conditional independence relations between them. To this end, we propose the first probabilistic circuits (PCs) approach for modeling the joint distribution of multivariate time series, called Whittle sum-product networks (WSPNs). WSPNs leverage the Whittle approximation, casting the likelihood in the frequency domain, and place a complex-valued sum-product network, the most prominent PC, over the frequencies. The conditional independence relations among the time series can then be determined efficiently in the spectral domain. Moreover, WSPNs can naturally be placed into the deep neural learning stack for time series, resulting in Whittle Networks, opening the likelihood toolbox for training deep neural models and inspecting their behaviour. Our experiments show that Whittle Networks can indeed capture complex dependencies between time series and provide a useful measure of uncertainty for neural networks.

Image of Fabrizio Ventola

About the presenter: Fabrizio Ventola is a Ph.D. student in Machine Learning at the Computer Science Department of TU Darmstadt University, Germany, as part of the KompAKI project. Previously, he has been part of AIPHES (TU Darmstadt, HITS, Heidelberg University), a Research Training Group mainly focused on Natural Language Processing on large-scale text sources. His research focuses on deep models for tractable probabilistic inference. They can help humans in modeling complex phenomena by dealing with uncertainty. Fabrizio Ventola aims to enable these models to efficiently provide insights on extensive (unstructured) data collections, perform accurate and trustworthy predictions, and generate new valuable data.

Do We Really Need Data Cleaning for ML? A Comprehensive Benchmark (Work in Progress)

Mohamed Abdelaal (Software AG)

Watch recording

Abstract: Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world structured data suffer from different types of discrepancies, such as missing values, outliers, duplicates, patter violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced from the data management community and the ML community. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in the ML pipeline. To this end, the benchmark examines several simple and advanced error detection and repair methods. To evaluate these methods, we utilize publicly-available datasets covering different domains together with a wide collection of ML models.

Image of Mohamed Abdelaal

About the presenter: Mohamed Abdelaal is a Research Scientist at Software AG in Stuttgart, specializing in the intersection of IoT and machine learning. He holds a Ph.D. in Computer Science from Carl von Ossietzky Universität Oldenburg. His academic journey has been distinguished with multiple awards, including the Mobinil Excellence Award and recognitions from Port Said and Suez Canal Universities. Beyond his extensive research experience, Mohamed has a notable record of publications in energy efficiency and sensor networks.

Flexible and Extensible Competency Management with Knowledge Graphs

Peter Haase (metaphacts)

Watch recording

Abstract: Especially in the diverse and fast-paced field of Artificial Intelligence it is imperative to have a clear picture of relevant competencies and how they are distributed within or over organisations. For this purpose, we have developed a generic competency ontology that can be used to describe competencies of people and organisational structures in the Artificial Intelligence domain. The ontology is embedded in an application to create, manage, and utilize a Competency Knowledge Graph. In our presentation we show concrete application scenarios, advantages, and challenges.

Image of Peter Haase

About the presenter: Peter Haase studied Computer Science in Rostock, and in 2006, he completed his Doctorate at the Karlsruhe Institute for Technology (KIT). Having worked as a software engineer and later as a researcher at IBM and fluid Operations, he founded metaphacts in 2014. As the Chief Scientific Officer, his research focuses on managing and using Knowledge Graphs.

Towards Learned Metadata Extraction for Data Lakes

Sven Langenecker (Technische Universität Darmstadt & Duale Hochschule Baden-Württemberg Mosbach)

Watch recording

Abstract: An important task for enabling the efficient exploration of available data in a data lake is to annotate semantic type information to the available data sources. In order to reduce the manual overhead of annotation, learned approaches for automatic metadata extraction on structured data sources have been proposed recently. While initial results of these learned approaches seem promising, it is still not clear how well these approaches can generalize to new unseen data in real-world data lakes. In this tallk, we aim to tackle this question and as a first contribution show the result of a study when applying Sato — a recent approach based on deep learning — to a real-world data set. In our study we show that Sato without re-training is only able to extract semantic data types for about 10% of the columns of the real-world data set. These results show the general limitation of deep learning approaches which often provide near-perfect performance on available training and testing data but fail in real settings since training data and real data often strongly vary. Hence, as a second contribution we propose a new direction of using weak supervision and present results of an initial prototype we built to generate labeled training data with low manual efforts to improve the performance of learned semantic type extraction approaches on new unseen data sets.

Image of Sven Langenecker

About the presenter: Sven Langenecker joined the Data Management Lab in May 2020. His primary research focuses on metadata management for data lakes and the associated automatic extraction of metadata information. He holds a Master’s degree in Computer Science and a Bachelor’s degree in Mechatronics, both from DHBW and in cooperation with Läpple AG.

AutoML - Goals, Trends and Future Research

Jonas Seng (Technische Universität Darmstadt)

Watch recording

Abstract: AutoML is one of the hottest research-branches in the AI-community. Since ML is being applied on problems across many domains from economy to research, the job of ‘Data Scientists’ has evolved in the last years. Data Scientists are people who gather data, understand business/research objectives and try to extract new knowledge from data, often using ML-techniques. However, Data Scientists are often a bottleneck on the way of knowledge generation because there is still much manual work included in Data Science-tasks. AutoML tackles this problem by automating the Data Science-process with the goal to remove the hurdles from using ML. There are already systems automating parts of the Data Science-process, however more research is needed in order to end up with a system capable of fully automating the Data Science-cycle from data extraction to model evaluation.

Image of Jonas Seng

About the presenter: Jonas Seng received a Master’s degree in Computer science from TU Darmstadt in 2019. As of 2021, he is a doctoral researcher at the Artificial Intelligence and Machine Learning Lab (AIML) at TU Darmstadt. His research interests center on AutoML, which aims to automate building Machine Learning (ML) models and pipelines. Currently, his focus is Neural Architecture Search (NAS), Hyperparamter-optimization, and connecting AutoML with Federated Learning (FL).

Bayesian Classifier Fusion with an Explicit Model of Correlation

Susanne Trick (Technische Universität Darmstadt, Psychologie der Informationsverarbeitung)

Watch recording

Abstract: Combining the outputs of multiple classifiers or experts into a single probabilistic classification is a fundamental task in machine learning with broad applications from classifier fusion to expert opinion pooling. Here we present a hierarchical Bayesian model of probabilistic classifier fusion based on a new correlated Dirichlet distribution. This distribution explicitly models positive correlations between marginally Dirichlet-distributed random vectors thereby allowing explicit modeling of correlations between base classifiers or experts. The proposed model naturally accommodates the classic Independent Opinion Pool and other independent fusion algorithms as special cases. It is evaluated by uncertainty reduction and correctness of fusion on synthetic and real-world data sets. We show that a change in performance of the fused classifier due to uncertainty reduction can be Bayes optimal even for highly correlated base classifiers.

Image of Susanne Trick

About the presenter: Susanne Trick joined the lab as a PhD student in January 2019. She is part of the IKIDA research group investigating interactive AI algorithms and human-robot interaction. Besides the IKIDA project, from 2019 to 2021, she was also working on the Kobo34 project, which aims to contribute to maintaining the independence of elderly people with a humanoid robot that supports everyday life activities. Susanne Trick’s research focuses on understanding and predicting human behavior, particularly in the interaction between humans and robots. Her main interest is integrating data from multiple modalities (e.g., gaze, gestures, speech). Consequently, she also works on probabilistic modeling of optimal combinations of multimodal data, particularly considering their uncertainty and correlation.

Towards “Large Table Models” for Enterprise Data Management

Madelon Hulsebos (INDE Lab of the University of Amsterdam & Sigma Computing)

Watch recording

Abstract: We developed models that “understand” images, code, and natural language, and put them to use for driving cars, correcting code, and explaining jokes. However, we have long ignored the relational data that dominates the enterprise data landscape. The top-3 database systems, for example, are all relational database systems, and the dominant file formats (37%) indexed by Google Dataset Search are all tabular. To make the “modern data stack” intelligent towards relational data, we need to shift our attention from Large Language Models to Large Table Models. In this talk, I will discuss the progress in this direction, and present Sherlock; a model for surfacing the semantics of table columns, GitTables; a large-scale corpus of tables extracted from CSV files on GitHub, and close with ongoing efforts.

Image of Madelon Hulsebos

About the presenter: Madelon Hulsebos is a PhD candidate at the INDE Lab of the University of Amsterdam and PhD researcher at Sigma Computing. She is interested in making (relational) data systems intelligent. Prior to starting her PhD, she did research at the MIT Media Lab and worked as a data scientist in industry.

Dynamic Algorithm Configuration

Theresa Eimer (Leibniz University Hannover)

Watch recording

Abstract: Hyperparameters play a key role in determining many algorithms’ performance. We have seen, however, that choosing a single static value for each hyperparameter may not be optimal. Schedules and heuristics that adapt e.g. learning rates for neural networks or step sizes in evolutionary algorithms show that different timepoints in the algorithm’s runtime have different ideal hyperparameter settings. Dynamic Algorithm Configuration aims to improve upon simple heuristics by learning to control hyperparameters dynamically across different algorithm instances. This talk will cover recent advances in dynamic configuration in Machine Learning as well as open challenges for future research.

Image of Theresa Eimer

About the presenter: Theresa Eimer is a research scientist at the Institute of AI at the Leibniz University Hannover. Her main research interest is in AutoML for and with Reinforcement Learning, through Dynamic Algorithm Configuration and AutoRL. She focuses on benchmarking and improving generalization as well as making training more reliable and efficient.

Scalable Data Visualization

Dominik Moritz (Carnegie Mellon University & Apple)

Watch recording

Abstract: In this talk, I will cover the challenges of and solutions for scaling interactive visualization systems to large data. The challenges for scalable visualization systems can be grouped into two areas: visual and interactive scalability. A visual representation that shows every record as its own mark quickly overwhelms our visual system; we call this problem perceptual scalability. For very large data, necessary transformations can overwhelm the data processing system; we call this the problem of interactive scalability. In this talk, I will elaborate on these problems and present solutions that enable interactive analysis of billions of records without any negative impacts of latency.
The given talk is an extended version of a section of my job talk recorded at Microsoft Research.

Image of Dominik Moritz

About the presenter: Dominik Moritz is on the faculty at Carnegie Mellon University where he co-leads the Data Interaction Group at the Human-Computer Interaction Institute and manages the visualization team in Apple’s machine learning organization. At CMU, his group’s research develops interactive systems that empower everyone to effectively analyze and communicate data. Dominik’s systems (Vega-Lite, Falcon, Draco, Voyager, and others) have won awards at academic venues (e.g. IEEE VIS and CHI), are widely used in industry, and by the Python and JavaScript data science communities. Dominik did his PhD at the Paul G. Allen School at the University of Washington, where he was advised by Jeff Heer and Bill Howe.

DiffML: End-to-end Differentiable ML Pipelines

Eduardo S. dos Reis (Technische Universität Darmstadt)

Watch recording

Abstract: In this paper, we present our vision of differentiable ML pipelines called DiffML that truly allows automating the construction of ML pipelines in an end-to-end fashion. Recently there has been a lot of work on automating individual steps of ML pipelines, such as data cleaning. However, while these techniques help with reducing the manual effort, it is not guaranteed that they also improve the performance of the downstream ML model. To this end, with DiffML we propose to jointly train the ML model with the data engineering steps in an automated manner. Our core novel idea is to formulate all steps and their alternatives in a differentiable way such that the entire ML pipeline can be trained end-to-end using backpropagation. However, this is a non-trivial problem and opens up many new research questions. To show the feasibility of this direction, we demonstrate initial ideas and a general principle of how typical data engineering steps can be formulated as differentiable programs and jointly learned with the ML model. Moreover, we discuss a research roadmap and core challenges that have to be systematically tackled to enable fully differentiable ML pipelines.

Image of Eduardo S. dos Reis

About the presenter: Eduardo S. dos Reis is a Brazilian researcher with a Master’s degree in applied computing, focused on Deep Learning methods for Human Pose Estimation, and 5 years of research in partnership with Siemens healthineers - on workflow analysis for operating rooms (Healthcare) - and Dell (financial apps). Currently, his research interests are end-to-end trainable data engineering pipelines (e.g. missing values imputation, data cleaning, data augmentation), as a PhD candidate at the TU-Darmstadt’s Data Management lab.

Green AutoML: A Paradigm Shift Towards a More Mindful Resource Consumption

Tanja Tornede (Leibniz University Hannover)

No recording available (yet).

Abstract: AutoML is a well established research field worked by international researchers, that even has its own dedicated conference since this year. The tedious work machine learning scientists are doing on their day-to-day life inspired researchers to work on the automation of different aspects aiming to get the best performing model. For a long time, this challenge has been tackled with immense computational power resulting in an immense environmental footprint. Due to the ongoing climate change, the consequences of emitted CO2e are becoming more and more perceptible. As a consequence, a shift in the AutoML community towards the consideration of its environmental footprint is of extreme importance and inevitable. In this talk, options for quantifying such a footprint are summarized, as well as existing work as strategies on how to design and benchmark AutoML tools with their footprint in mind are revisited. Furthermore, different aspects on how to be transparent about the environmental footprint and what kind of research incentives could direct the community into a more sustainable AutoML research direction are elaborated on.

About the presenter: Tanja Tornede is a Researcher at the Institute of Artificial Intelligence at Leibniz University Hannover (LUH). She is doing her Ph.D. at Paderborn University supervised by Eyke Hüllermeier within the field of AutoML for Predictive Maintenance, especially the estimation of Remaining Useful Lifetime of machines. Nowadays, she more and more thought about the environmental impacts of AutoML in general, which is why she started her research into that direction and coined the term Green AutoML. Since 2021 she is on a mission to create awareness for the environmental impact of (AutoML) research, motivating researchers to rethink their own habits and workflows.

Efficient Hyperparameter Optimization Through Transfer Learning and Early Stopping

Huibin Shen (Amazon)

Watch recording

Abstract: Hyperparameter Optimization (HPO) has been widely known and applied in both academic and industry. But the process can be costly, especially for tuning hyperparameters for deep learning models, hindering its further adoption. In this talk, we will start with basics of HPO and move to improving its efficiency through transfer learning and review some notable works during the past few years. Finally, we will introduce automatic termination that speeds up the process from an orthogonal direction.

Image of Huibin Shen

About the presenter: Huibin is currently a Senior Applied Scientist with AWS where he worked on hyperparameter optimization, AutoML and recently AIOps and forecasting. He’s mainly interested in inventing simple and scientifically sound solutions for practical problems. Before joining Amazon, he completed his PhD in Machine Learning and Computational Biology from Aalto University, Finland in 2017.

Accurate and Efficient Multi-Task Learning for Vision Tasks

Lijun Zhang (University of Massachusetts Amherst)

Watch recording

Abstract: AI-powered applications increasingly adopt Deep Neural Networks (DNNs) for solving many prediction tasks, leading to more than one DNNs running on resource-constrained devices. Supporting many models simultaneously on a device is challenging due to the linearly increased computation, energy, and storage costs. An effective approach to address the problem is multi-task learning (MTL) where a set of tasks are learned jointly to allow some parameter sharing among tasks. MTL creates multi-task models based on common DNN architectures and has shown significantly reduced inference costs and improved generalization performance in many vision applications. In this talk, I will introduce our recent efforts on leveraging MTL to improve accuracy and efficiency in vision tasks. The talk will introduce multi-task architecture design systems that can automatically identify resource-efficient multi-task models with low computation costs and high task accuracy.

Image of Lijun Zhang

About the presenter: Lijun Zhang is a Ph.D. candidate in the College of Information and Computer Sciences (CICS) at the University of Massachusetts Amherst, the flagship campus of the UMass system, where she is advised by professor Hui Guan. She received her Master and Bachelor in Software Engineering from Tongji University. Her research focuses on Machine Learning and Computer Vision, with a particular emphasis on developing and optimizing multi-task learning algorithms and systems for vision tasks. Her current work involves developing multi-task models that automatically achieve high task accuracy, low computation cost, and strong robustness.

Querying Large Language Models with SQL

Paolo Papotti (EURECOM)

No recording available (yet).

Abstract: With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. Thus, we envision the use of SQL queries to cover a broad range of data that is not captured by traditional databases (DBs) by tapping the information in LLMs. This ability would enable the hybrid querying of both LLMs and DBs with the SQL interface, which is more expressive and precise than NL prompts. To ground this vision, we present a prototype based on a traditional DB architecture with new physical operators for querying the underlying LLM. For a large class of SQL queries, querying LLMs returns well structured relations, with encouraging qualitative results. We pinpoint several research challenges that must be addressed to build a DBMS that jointly exploits LLMs and DBs. While some challenges call for new contributions from the NLP field, others offer novel research avenues for the DB community.

Image of Paolo Papotti

About the presenter: Paolo Papotti is an Associate Professor at EURECOM, France since 2017. He got his PhD from Roma Tre University (Italy) in 2007 and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data management and information quality. He has authored more than 140 publications, and his work has been recognized with two “Best of the Conference” citations (SIGMOD 2009, VLDB 2016), three best demo award (SIGMOD 2015, DBA 2020, SIGMOD 2022), and two Google Faculty Research Award (2016, 2020).

Leveraging LLMs for Human Mobility Forecasting

Hao Xue (University of New South Wales, Sydney)

Watch recording

Abstract: The release of ChatGPT and the fast development of GPTs have resulted in a transformative trend in our daily lives: the utilization of prompts has become increasingly popular. In this talk, I will present a new versatile paradigm, leveraging Large Language Models (LLMs) for time-series forecasting. This paradigm opens up exciting possibilities in domains such as traffic forecasting and energy demand forecasting, where the integration of natural language prompts not only demonstrates excellent forecasting performance but also shows the flexibility of predictive models. Further, I will introduce and present our recent vision paper: Artificial General Intelligence for Human Mobility. This paper outlines a visionary approach to harnessing artificial general intelligence to revolutionize the way to develop AI models for human mobility-related tasks. I will also provide insights into the conceptual framework and potential applications of this groundbreaking concept.

Image of Hao Xue

About the presenter: Dr Hao Xue currently holds a Lecturer position at the School of Computer Science and Engineering at University of New South Wales (UNSW Sydney), Australia. He obtained his PhD from The University of Western Australia in 2020. After completing his doctorate, he worked as a Research Fellow at the School of Computing Technologies at RMIT University and UNSW Sydney. He is an Associate Investigator at the UNSW node of the ARC Centre of Excellence for Automated Decision-Making and Society (ADM+S). He was awarded the DAAD AInet Fellowship in 2022 and is a member of the Research Infrastructure Committee (Transport/Mobility Focus Area) at ADM+S. His research interests include spatiotemporal data modelling, time series forecasting, language generation based forecasting, and data-efficient time series representation learning. He has years of experience in developing AI models for analyzing human mobility behaviours and has received fundings for multiple research projects.

CAESURA: Language Models as Multi-Modal Query Planners

Matthias Urban (Technische Universität Darmstadt)

No recording available (yet).

Abstract: Traditional query planners translate SQL queries into query plans to be executed over relational data. However, it is impossible to query other data modalities, such as images, text, or video stored in modern data systems such as data lakes using these query planners. In this paper, we propose Language-Model-Driven Query Planning, a new paradigm of query planning that uses Language Models to translate natural language queries into executable query plans. Different from relational query planners, the resulting query plans can contain complex operators that are able to process arbitrary modalities. As part of this paper, we present a first GPT-4 based prototype called CAESURA and show the general feasibility of this idea on two datasets. Finally, we discuss several ideas to improve the query planning capabilities of today’s Language Models.

Image of Matthias Urban

About the presenter: Matthias Urban is a PhD Candidate at the Systems Group at Technical University of Darmstadt. He works at the intersection of Large Language models and Databases

Towards Foundation Models For Relational Databases

Liane Vogel (Technische Universität Darmstadt)

No recording available (yet).

Abstract: Data engineering tasks, such as data transformation and data cleaning, currently still require a high amount of manual effort. The field of table representation learning develops methods to automate these tedious tasks. However, existing tabular representation models only learn a representation from individual tables. Despite the fact that most organizations use databases, representation learning for relational databases with multiple tables is still an under-explored area. In this talk, we present our vision of relational representation learning, that not only learns from the full structure of relational databases, but also can scale to larger database sizes that are common in the real world.

Image of Liane Vogel

About the presenter: Liane Vogel is a PhD Candidate at the Data and AI Systems Lab at the Technical University of Darmstadt, with a background in Machine Learning, Natural Language Processing and Computer Science.