This page gives a quick overview of my work in

Query Optimization

Dissertation on query optimization - overview

Declarative query languages such as SQL allow users to simply describe the data they need instead of specifying how to generate it. Such languages are based on query optimizer components that translate declarative queries into executable query plans. The goal of query optimization is generally to find an optimal query plan for a given query (e.g., a plan with minimal execution time). The query optimization problem has been introduced in the 70's and is perhaps the optimization problem in the database community that most work has been published on. However, the context of query optimization is continuously changing which leads to novel challenges. For instance, modern execution engines have many more tuning parameters than before and motivate us to compare alternative query plans according to multiple cost metrics instead of only one. Both makes optimization more challenging. On the other hand, novel optimization platforms are nowadays available that can be exploited for query optimization.

In my dissertation, I address several extended variants of query optimization for which no practical algorithms were previously available. Also, I show how to optimize queries in traditional query optimization that are by far too large for prior methods. I propose a variety of approaches for making query optimization more efficient that can be categorized into three high-level ideas, as illustrated in the overview figure above. First, if queries correspond to query templates that are known in advance, we can make optimization a pre-processing step. Even though optimization may take a long time, it does not add any run time overhead. Second, we can speed up optimization by relaxing optimality guarantees. We have published approximation schemes, incremental algorithms, and randomized algorithms for novel query optimization variants. Finally, we can decrease optimization time by leveraging novel software and hardware platforms for optimization. We have recently shown how to leverage integer programming solvers, massively parallel computation clusters, and quantum annealers for query optimization variants. In many scenarios, our approaches reduce optimization times from hours or even years to just minutes or even seconds.

Data Vocalization

Driven by advances in speech recognition and synthesis, the interaction between user and computer is currently shifting towards voice based interfaces. Several major IT companies have recently presented devices and tools that use voice as primary communication medium. Examples include but are not limited to Google Home, Amazon Echo, and Apple's Siri. Those and other devices often need to communicate relational data to users (e.g., Google Home returns structured search results via voice output). There is a large body of research work available on how to optimally represent relational data to users. However, prior work focuses nearly exclusively on visual data representations which have been dominant over a long time.

In this project, we study the question of how to Vocalize data, i.e. how to translate data into optimal voice output. Voice output has to be extremely efficient. While users can themselves quickly identify relevant parts in a plot or written text (via skimming), they have to trust the computer to select only the most relevant information for voice output. Voice output transmits information rather slowly (compared to reading text), still it has to be short (in order to not to exceed the user's attention span), while at the same time of a simple structure (in order to restrict cognitive load on the listener). We treat voice output generation as a global optimization problem in which various metrics (e.g., speaking time, precision of transmitted information, cognitive load on the listener) come into play.

We are currently studying data vocalization in different scenarios, characterized by data type, size, and user context among other criteria. We are also studying questions of how to optimally support voice output via data processing, i.e. how to avoid generating results of a degree of detail that cannot be transmitted via speech. The results of this research project are integrated into CiceroDB, a prototypical database system that is designed from the ground up for efficient and effective voice output.

Fact Checking

Data sets are often summarized via natural language text documents. Examples include newspaper articles by data journalists, scientific papers summarizing experimental results, or business reports summarizing quarterly sales. A majority of the population never accesses raw relational data but relies on text summaries alone. In that context, the following question arises: how can we trust such summaries to be consistent with the data?

We are developing approaches for automated and semi-automated fact checking of data summaries to answer that question. A text document, together with an associated data set, form the input for fact checking. Our goal is to identify erroneous claims about the data in the input text. More precisely, we focus on text passages that can be translated into a pair of an SQL query and a claimed query result. A claim is erroneous if evaluating the query yields a result that cannot be rounded to the one claimed in text.

In our first project in this space, we have developed a "fact checker tool" that supports authors in producing accurate data summaries. The tool is similar in spirit to a spell checker: where a spell checker supports users in avoiding erroneous spelling and grammatical mistakes, the fact checker supports users in avoiding erroneous claims. We focus on a restricted class of claims that are at the same time common and error-prone. The fact checker translates text passages into equivalent SQL queries, evaluates them on a database, and marks up potentially erroneous claims. Users obtain a natural language explanation, summarizing the system's interpretation of specific text passages, and can easily take corrective actions if necessary. We have recently used this tool to identify erroneous claims in articles from several major newspapers, some of which had gone by unnoticed for years.

Quantum Computing

Quantum computing is currently reaching a tipping point where a purely theoretical research area turns into one that allows empirical evaluation. Several companies have recently presented first hardware devices that are claimed to realize at least a limited form of quantum computing. We currently have access to a D-Wave 2X adiabatic quantum annealer with over 1000 qubits. Those machines solve NP-hard optimization problems. Using them in practice is however often challenging and requires problem-specific transformation and mapping methods. Our goal is to find out, if and how such machines can be used to solve optimization problems that arise in the context of data analysis and data management.

I have recently presented the first paper on quantum computing on the VLDB conference (see talk video below). In this paper, we focus on a classical optimization problem from the database domain, the problem of multiple query optimization. We show how to map instances of that problem to the restrictive input format supported by the quantum annealer. Doing so is challenging, mainly due to the limited number of qubits and sparse connections between qubits. We describe problem representations with low asymptotic qubit growth that map well to the sparse connection structure between qubits. Also, we introduce an algorithm that calculates such mappings efficiently at run time. Finally, we present experimental results obtained on the quantum annealer, thereby evaluating the practicality and performance of the proposed method.


Text Mining and Machine Learning

Search engine providers such as Google regularly receive queries that contain subjective predicates. Users query for instance for "big cities", "cute animals", or "safe cities". In order to answer those queries from structured data, search engine providers need to understand what subjective properties the average user associates to which entities. This was the motivation for a project that I recently concluded at Google Mountain View. During this project, we developed the Surveyor system that mines the entire Web to find billions of subjective associations.

We use natural language analysis to identify statements in Web text that express an opinion about whether or not a specific property applies to a specific entity. As we consider subjective properties, user opinions diverge and we need to resolve conflicting opinions in order to correctly infer the majority opinion. This is surprisingly difficult and simple resolution strategies (e.g., use the majority vote among conflicting opinions) lead to poor results (i.e., poor match with opinions collected from test users). The reason for that are various types of skew that influence the probability that users express a certain opinion on the Web. For instance, users who think that a certain city is big are more likely to write about it on the Web. Hence, they are always overrepresented in a sample of opinions collected from the Web. We overcome those challenges by learning property and entity type specific user behavior models by unsupervised machine learning using an expectation-maximization approach. We show by a large user study that we can use those models to quite reliably infer the opinions of the average user. The system is described in more detail in the corresponding SIGMOD 2015 talk: