The missing piece in ML-based query optimization
In the new era of ML-based systems, many works have proposed ML as a remedy to the problems of cost and cardinality estimation in query optimization. These works usually include either building an ML model to estimate the cost or cardinality of a query plan or learning the entire optimizer. Although these solutions have demonstrated promising results, they come with a new challenge which our community has been largely overlooked: the low availability of training data. In ML-based query optimization training data includes large and diverge plan workloads together with their execution time or output cardinality. Collecting such training data has a very high cost in terms of time and money due to the development and execution of thousands of realistic query plans. In this talk, I will walk you through our journey of building the ML-based optimizer of Apache Wayang (Incubating) and discuss how we can overcome the challenge of collecting training data for ML-based query optimization using innovative data-driven methods.
Zoi Kaoudi is a Senior Researcher in the DIMA team of the Technical University of Berlin. She has previously worked as a Scientist in the Qatar Computing Research Institute (QCRI) of the Hamad Bin Khalifa University in Qatar, in IMIS-Athena Research Center as a research associate and Inria as a postdoctoral researcher. She received her PhD from the National and Kapodistrian University of Athens in 2011. Her research interests lie in the intersection of machine learning systems, data management and knowledge graphs. She is currently Associate Editor of SIGMOD 2022 and has been the proceedings chair of EDBT 2019, co-chair of the TKDE poster track co-located with ICDE 2018, and co-organizer of the MLDAS 2019 held in Qatar. She has co-authored articles in both database and ML communities and served as member of Program Committee for several international database conferences. She has recently received the best demonstration award at ICDE 2022 for her work on „Training data generation for ML-based query optimization“.
Learned DBMS Components 2.0: From Workload-Driven to Zero-Shot Learning
Database management systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMS components where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current workload-driven learning approaches to enable learned DBMS components is that they not only cause a very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical.
Hence, in this talk, we present our vision to tackle the high costs and inflexibility of workload-driven learning called Learned DBMS Components 2.0. First, we introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. We thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train models that generalize to unseen data sets out of the box. The idea is to train a model that has observed a variety of workloads on different data sets and can thus generalize. Initial results on the task of physical cost estimation suggest the feasibility of this approach. Finally, we discuss further opportunities which are enabled by zero-shot learning.
Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and an Adjunct Associate Professor in the Computer Science department at Brown University. Carsten received his Ph.D. at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data management systems, databases, and modern hardware as well as machine learning for scalable systems. His work has been awarded a Google Faculty Award, as well as multiple best paper and best demo awards for his research.
Data Science in the wild – From Online Games to Bees
Digitization and the Internet have now reached almost all areas of our
lives and more and more data is becoming available. This is happening not
only in shopping or the next social media post, but also in many other
areas, such as medicine, traffic, production, education or even science.
All this data needs to be analyzed to better understand users, support
doctors, detect diseases, optimize production, improve education, or
support science and make data a valuable resource. The application of Data
Science methods is challenging, but it is also changing the way data is
worked with and paving the way to more insights.
In this talk, we will show how such data can be used to support various
application domains using deep learning models. The first part focuses on
multiplayer online games. More and more young people are not only active
gamers themselves and need support e.g. in the form of purchase
recommendations, but also watch games on streaming platforms like
twitch.tv, which results in exciting applications for NLP research. For
both areas, we have been able to collect large amounts of data and will
present new data science approaches and gained insights. In the second
part of the talk, we will present new deep learning methods for analyzing
data from smart beehives of our We4Bee project. Besides anomaly detection,
we will also focus on systematic analysis of collected bee data using
semi-supervised learning methods.
Catarina Pinto Moreira
Towards Human-Centred AI: What Can Machines Learn from Eye-Tracking Data
Artificial Intelligence (AI) and Deep Learning (DL) technologies have made great strides in equalling and even surpassing human performance in many tasks, particularly in healthcare. Although DL cannot replace clinicians in medical diagnosis, it can support expert radiologists in performing time-consuming tasks, such as examining chest X-rays for signs of pneumonia or COVID-19. Despite this success, the internal mechanisms of these technologies are an enigma because humans cannot scrutinize how these systems do what they do. This uncertainty poses a significant concern in adopting AI-based technologies in healthcare because they are highly susceptible to biases due to the computation of spurious correlations during the prediction process which can put human lives in danger. This talk will present an ongoing project and its preliminary results. The project aims to make DL models understandable to radiologists by investigating how eye-tracking data can be used to teach a machine how radiologists read and classify chest X-ray images. By using multimodal data containing chest X-ray images, radiologists‘ eye patterns, and their respective audio recordings, this project aims to devise new methods to extract radiologists‘ cognitive maps. We will pioneer the construction of new multimodal DL architectures that learn to identify abnormalities and regions of interest using the radiologists‘ cognitive maps and X-ray images to teach machines how radiologists diagnose X-ray images. We will use that knowledge to generate explanations and promote trust in the adoption of AI-based systems, leading to a more accurate, augmented, and enhanced decision-making process in clinical practice. One important byproduct of this project will be novel computer-assisted training for young radiologists and imagiology students by designing an Explainable User Interface to facilitate the learning and training of radiologists.