Publications (Hide summaries) (Show abstracts)

2024

Relational Diagrams project page
On the Reasonable Effectiveness of Relational Diagrams: Explaining Relational Query Patterns and the Pattern Expressiveness of Relational Languages
Wolfgang Gatterbauer, Cody Dunne
SIGMOD 2024 best paper honorable mention (announcement)
Proposes a semantic definition of relational query patterns, which allows us to analyze the relative pattern expressiveness of relational query languages. Also proposes "relational diagrams", a natural diagrammatic representation of tuple relational calculus.
Comparing relational languages by their logical expressiveness is well understood. Less well understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than “a certain way of writing a query”? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure-preserving mappings between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational language fragments and create a hierarchy of languages with equal logical expressiveness yet different pattern expressiveness. Notably, for the non-disjunctive language fragment, we show that relational calculus can express a larger class of patterns than the basic operators of relational algebra.
Our language-independent definition of query patterns opens novel paths for assisting database users. For example, these patterns could be leveraged to create visual query representations that faithfully represent query patterns, speed up interpretation, and provide visual feedback during query editing. As a concrete example, we propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus that is provably (𝑖) unambiguous, (𝑖𝑖) relationally complete, and (𝑖𝑖𝑖) able to represent all query patterns for unions of non-disjunctive queries. Among all diagrammatic representations for relational queries that we are aware of, ours is the only one with these three properties. Furthermore, our anonymously preregistered user study shows that Relational Diagrams allow users to recognize patterns meaningfully faster and more accurately than
Relational Diagrams project page
A Comprehensive Tutorial on over 100 years of Diagrammatic Representations of Logical Statements and Relational Queries
Wolfgang Gatterbauer
ICDE 2024.
Surveys the key visual metaphors developed for visual representations of relational expressions.
A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations
Neha Makhija, Wolfgang Gatterbauer
Proposes a unified approach for solving reverse data management problems (including various interventions for fairness): all cases (including self-joins, bags, PTIME or NP-complete cases) are solved with the same algorithm. The algorithm just "happens" to terminate in PTIME for all known PTIME cases. Also suggests a non-trivial Disjunctive Logic Program that can automatically find "hardness certificates" (think gadgets) for conjunctive queries.
Resilience is one of the key algorithmic problems underlying various forms of reverse data management (such as view maintenance, deletion propagation, and various interventions for fairness): What is the minimal number of tuples to delete from a database in order to remove all answers from a query? A long-open question is determining those conjunctive queries (CQs) for which this problem can be solved in guaranteed PTIME. We shed new light on this and the related problem of causal responsibility by proposing a unified Integer Linear Programming (ILP) formulation. It is unified in that it can solve both prior studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (e.g., all CQs under set or bag semantics It is also unified in that all queries and all instances are treated with the same approach, and the algorithm is guaranteed to terminate in PTIME for the easy cases. We prove that, for all easy self-join-free CQs, the Linear Programming (LP) relaxation of our encoding is identical to the ILP solution and thus standard ILP solvers are guaranteed to return the solution in PTIME. Our approach opens up the door to new variants and new fine-grained analysis: 1) It also works under bag semantics and we give the first dichotomy result for bags semantics in the problem space. 2) We give a more fine-grained analysis of the complexity of causal responsibility. 3) We recover easy instances for generally hard queries, such as instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture from PODS 2020. 5) Experiments confirm that our results indeed predict the asymptotic running times, and that our universal ILP encoding is at times even faster to solve for the PTIME cases than a prior proposed dedicated flow algorithm.
Factorization
Minimally Factorizing the Provenance of Self-Join Free Conjunctive Queries
Neha Makhija, Wolfgang Gatterbauer
Proposes the problem of finding the minimal-size factorization of the provenance of queries. Shows it to be a natural generalization and to recover all known PTIME cases of the problem of exact probabilistic inference. Makes first steps towards a dichotomy of the data complexity of the problem.
We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e., we want to find an equivalent propositional formula that minimizes the number of variable occurrences. Our work is partly motivated from probabilistic inference where read-once formulas are known to allow exact PTIME solutions and non-read-once formulas allow approximate solutions with an error that depends on the number of repetitions of variables. We embark on the challenge of characterizing the data complexity of this problem and show its connection to the query resilience problem. While the problem is NP-complete in general, we develop an encoding as max-flow problem that is guaranteed to give the exact solution for several queries (and otherwise approximate minimizations). We show that our encoding is guaranteed to return a read-once factorization if it exists. Our problem and approach is a complete solution that naturally recovers exact solutions for all known PTIME cases, as well as identifying additional queries for which the problem can be solved in PTIME.
HITSNDIFFS: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property
Zixuan Chen, Subhodeep Mitra, R Ravi, Wolfgang Gatterbauer
ICDE 2024.
Unifies work on matrices with the Consecutive Ones Property (C1P) with Item Response Theory (IRT) by proposing a new variant of HITS which is guaranteed to return a C1P permutation of a matrix if it exists. We show the algorithm's potential for "ability discovery", the dual problem of "truth discovery" in the context of crowd sourcing.
We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the “truth”), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem “ability discovery” to emphasize the connection to and duality with the more well-studied problem of “truth discovery”.
To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call “HITSnDIFFs” (or HnD), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), H N D also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that H N D produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm.

2023

Relational Diagrams project page
A Tutorial on Visual Representations of Relational Queries
Wolfgang Gatterbauer
Surveys the key visual metaphors developed for visual representations of relational expressions.
Query formulation is increasingly performed by systems that need to guess a user’s intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the “right” query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries.
This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).
Towards Unbiased Exploration in Partial Label Learning
Zsolt Zombori, Agapi Rissaki, Kristóf Szabó, Wolfgang Gatterbauer, Michael Benedikt
Shows that standard cross-entropy as loss function can lead to a winner-takes-all phenomenon for supervised learning under random initialization. Proposes a variant of entropy regularization that fixes the problem for disjunctive supervision.
We consider learning a probabilistic classifier from partially-labelled supervision (inputs denoted with multiple possibilities) using standard neural architectures with a softmax as the final layer. We identify a bias phenomenon that can arise from the softmax layer in even simple architectures that prevents proper exploration of alternative options, making the dynamics of gradient descent overly sensitive to initialisation. We introduce a novel loss function that allows for unbiased exploration within the space of alternative outputs. We give a theoretical justification for our loss function, and provide an extensive evaluation of its impact on synthetic data, on standard partially labelled benchmarks and on a contributed novel benchmark related to an existing rule learning challenge.
Table-as-query
DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
Invited version from EDBT 2021 best paper award
Proposes an approach to detect homographs (i.e. data values with multiple meanings) in table repositories. Our idea is simple and surprisingly effective: we apply ideas from community detection in networks to a network representation of tables.
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values.
DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top200 precision is 89%.
Table-as-query
SANTOS: Relationship-based Semantic Table Union Search
Aamod Khatiwada, Grace Fan, Zixuan Chen, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
Proposes the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Interestingly, the necessary synthetic knowledge base can be created from the very data lake we are trying to integrate.
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns: The first uses an existing knowledge base (KB), the second (which we call a “synthesized KB”) uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm called SANTOS outperforms a state- of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically in all benchmarks that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KB from the data lakes having limited KB coverage and perform union search over them.
Ranked Enumeration
Efficient Computation of Quantiles over Joins
Nikolaos Tziavelis, Nofar Carmeli, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
Introduces the concept of quantile queries (%JQs) and gives new complexity results: %JQs ask for the answer at some specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ).
We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at some specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all query answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent dichotomy result rules out the existence of such an algorithm for a general family of queries and orders. Specifically, for acyclic JQs without self-joins, the problem becomes intractable for ordering by sum whenever we are essentially joining more than two relations. Moreover, even for basic ranking functions beyond sum, such as min or max over different attributes, it has not been known so far whether there is any nontrivial tractable %JQ.
In this work, we develop a new approach to solving %JQ and show how this approach allows not just to recover known results, but also generalize them and resolve open cases. Our solution uses two subroutines: The first one needs to select what we call a ``pivot answer.'' The second subroutine partitions the space of query answers according to this pivot, and continues searching in one partition that is represented as new %JQ over a new database. For pivot selection, we develop an algorithm that, under mild assumptions of monotonicity, works for every ranking function. The second subroutine requires a customized construction for the specific ranking function at hand.
We show the benefit and generality of our approach by using it to establish several new complexity results. First, we prove the tractability of min and max for all acyclic JQs, thereby resolving the above question. Second, we extend the previous dichotomy for sum to all partial sums (over all subsets of the attributes). Third, we handle the intractable cases of sum by devising a deterministic approximation scheme that applies to every acyclic JQ.
Ranked Enumeration
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
Invited version from best of PODS 2021
Builds upon our PODS'21 work on the complexity of direct access to a ranked list of answers to a database query. Establishes a complete dichotomy of the selection problem with lexicographic orders.
We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task of enumerating query answers of free-connex acyclic CQs, so that the access time is logarithmic. Moreover, it follows from past results that within the class of CQs without self-joins, being free-connex acyclic is necessary for the existence of such a construction (under conventional assumptions in fine-grained complexity).
In this work, we embark on the challenge of identifying the answer orderings that allow for ranked direct access with the above complexity guarantees. We begin with the class of lexicographic orderings and give a decidable characterization of the class of feasible such orderings for every CQ without self-joins. We then continue to the more general case of orderings by the sum of attribute scores. As it turns out, in this case ranked direct access is feasible only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only one single answer (i.e., finding the answer at a given position). We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.

2022

Table-as-query
Integrating Data Lake Tables
Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller
Proposes a heuristic algorithm to compute Full Disjunction (an associative variant of outer joins) by using complementation and subsumption operators.
We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980’s, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, a new proposal for integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables.
QueryVis project page
Principles of Query Visualization
Wolfgang Gatterbauer, Cody Dunne, H.V. Jagadish, Mirek Riedewald
Discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data
Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.
QueryVis project page
Interpreting and understanding relational database queries using diagrams
Wolfgang Gatterbauer
Ranked Enumeration
Any-k Algorithms for Enumerating Ranked Answers to Conjunctive Queries
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
Proposes the currently fastest ranked enumeration approach called Anyk-Part++. This approach asymptotically dominates all prior approaches both for TTF (Time-To-First) and TTL (Time-To-Last).
We study ranked enumeration for Conjunctive Queries (CQs) where the answers are ordered by a given ranking function (e.g., an ORDER BY clause in SQL). We develop "any-k" algorithms which, without knowing the number k of desired answers, push the ranking into joins and avoid materializing the join output earlier than necessary. For this to be possible, the ranking function needs to obey a certain kind of monotonicity; the supported ranking functions include the common sum-of-weights case where query answers are compared by sums of input weights, as well as any commutative selective dioid. One core insight of our work is that the problem is closely related to the fundamental task of path enumeration in a weighted DAG. We generalize and improve upon classic research on finding the k'th shortest path and unify into the same framework several solutions from different areas that had been studied in isolation. For the time to the k'th ranked CQ answer (for every value of k), our approach is optimal in data complexity precisely for the same class of queries where unranked enumeration is optimal -- and only slower by a logarithmic factor. In a more careful analysis of combined complexity, we uncover a previously unknown tradeoff between two different any-k algorithms: one has lower complexity when the number of returned results is small, the other when the number is very large. This tradeoff is eliminated under a stricter monotonicity property that we exploit to design a novel algorithm that asymptotically dominates all previously known alternatives, including the well-known algorithm of Eppstein for sum-of-weights path enumeration. We empirically demonstrate the findings of our theoretical analysis in an experimental study that highlights the superiority of our approach over the join-then-rank approach that existing database systems typically follow.
Ranked Enumeration
Toward Responsive DBMS: Optimal Join Algorithms, Enumeration, Factorization, Ranking, and Dynamic Programming
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
This tutorial shows interesting connections between a wide range of fundamental problems in computer science, including dynamic programming, factorized representations, join algorithms, and answer enumeration.
When processing join queries over big data, a DBMS can become unresponsive, i.e., it takes very long until any output tuples appear. Ranked enumeration addresses this problem by attempting to return the most important answers as quickly as possible, ideally in time that is linear (or quasilinear) in input size, even if the complete output is much larger. Aside from its practical usefulness, ranked enumeration is closely related to, and in a way unifies, several other problems involving joins. The common goal is the design of optimal algorithms that are guaranteed to avoid large intermediate results and thus achieve time or space complexity close to a lower bound. Arguably, avoiding query plans that produce huge intermediate results has been an overarching goal of database optimizers, which is part of the reason why optimal join algorithms, enumeration, and factorized representations have generated a lot of excitement. In this tutorial, we embark on an exploration of these topics, showing how they are intimately connected with a wide range of fundamental problems in computer science.

2021

Ranked Enumeration
Beyond Equi-joins: Ranking, Enumeration and Factorization
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
Develops methods for ranked enumeration for answers to queries with general join predicates involving conjunctions and disjunctions of inequalities. Surprisingly, the suggested solution comes to within a polylogarithmic factor of the best-known guarantee for equi-joins.
We study full acyclic join queries with general join predicates that involve conjunctions and disjunctions of inequalities, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach offers strong time and space complexity guarantees in the standard RAM model of computation, getting surprisingly close to those of equi-joins. With n denoting the number of tuples in the database, we guarantee that for every value of k, the k top-ranked answers are returned in O(n polylog n + k log k) time and O(n polylog n + k) space. This is within a polylogarithmic factor of the best-known guarantee for equi-joins and even O(n+k), the time it takes to look at the input and output k answers. The key ingredient is an O(n polylog n)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. As a side benefit, our techniques are also applicable to unranked enumeration (where answers can be returned in any order) for joins with inequalities, returning k answers in O(n polylog n + k). This guarantee improves over the state of the art for large values of k. In an experimental study, we show that our ranked-enumeration approach is not only theoretically interesting, but also fast and memory-efficient in practice.
Ranked Enumeration
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
Selected among best of PODS 2021
Poses a new question and gives partial answers: When can we allow for direct access to a ranked list of answers to a database query without (and by being considerably faster than) materializing all answers?
We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task of enumerating query answers of free-connex acyclic CQs, so that the access time is logarithmic. Moreover, it follows from past results that within the class of CQs without self-joins, being free-connex acyclic is necessary for the existence of such a construction (under conventional assumptions in fine-grained complexity).
In this work, we embark on the challenge of identifying the answer orderings that allow for ranked direct access with the above complexity guarantees. We begin with the class of lexicographic orderings and give a decidable characterization of the class of feasible such orderings for every CQ without self-joins. We then continue to the more general case of orderings by the sum of attribute scores. As it turns out, in this case ranked direct access is feasible only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only one single answer (i.e., finding the answer at a given position). We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.
Table-as-query
DomainNet: Homograph Detection for Data Lake Disambiguation
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
EDBT 2021 best paper award (announcement)
Proposes an approach to detect homographs (i.e. data values with multiple meanings) in table repositories. Our idea is simple and surprisingly effective: we apply ideas from community detection in networks to a network representation of tables.
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values.
DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top200 precision is 89%.
QueryVis project page
Stratisfimal Layout: A modular optimization model for laying out layered node-link network visualizations
Sara Di Bartolomeo, Mirek Riedewald, Wolfgang Gatterbauer, and Cody Dunne
Proposes a comprehensive approach to make complicated networks (such as those with layers and grouped nodes) easier to read. The idea is to formulate various readability criteria (notably crossing minimization and bendiness reduction) as a modular and customizable constraint optimization problem.
Node-link visualizations are a familiar and powerful tool for displaying the relationships in a network. The readability of these visualizations highly depends on the spatial layout used for the nodes. In this paper, we focus on computing layered layouts, in which nodes are aligned on a set of parallel axes to better expose hierarchical or sequential relationships. Heuristic-based layouts are widely used as they scale well to larger networks and usually create readable, albeit sub-optimal, visualizations. We instead use a layout optimization model that prioritizes optimality --- as compared to scalability --- because an optimal solution not only represents the best attainable result, but can also serve as a baseline to evaluate the effectiveness of layout heuristics. We take an important step towards powerful and flexible network visualization by proposing STRATIFIMAL LAYOUT, a modular integer-linear-programming formulation that can consider several important readability criteria simultaneously --- crossing reduction, edge bendiness, and nested and multi-layer groups. The layout can be adapted to diverse use cases through its modularity. Individual features can be enabled and customized depending on the application. We provide open-source and documented implementations of the layout, both for web-based and desktop visualizations. As a proof-of-concept, we apply it to the problem of visualizing complicated SQL queries, which have features that we believe cannot be addressed by existing layout optimization models. We also include a benchmark network generator and the results of an empirical evaluation to assess the performance trade-offs of our design choices.

2020

Ranked Enumeration
Optimal Join Algorithms meet Top-k
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
Presents a unified view for studying optimal join algorithms and top-k query evaluation.
Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.
Ranked Enumeration
Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries
Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, Xiaofeng Yang
First paper that generalizes dynamic programming to optimal ranked enumeration of answers. The theory applies to very general algebraic structures called "selective dioids" (semirings with an ordering property). This paper also applies the theory to the special case of ranked enumeration over full conjunctive queries. We call it "any-k" to emphasize the marriage of "anytime algorithms" with top-k. The full version also treats projections.
We study ranked enumeration of the results to a join query in order of decreasing importance, as imposed by an appropriate ranking function. Our main contribution is a framework for ranked enumeration over a class of Dynamic Programming problems that generalizes seemingly different problems that to date had been studied in isolation. To this end, we study and extend classic algorithms that find the k-shortest paths in a weighted graph. For full conjunctive queries, including cyclic ones, our approach is asymptotically optimal in terms of (1) time to return the top result, (2) delay between results, and (3) time until all results are returned in order. These optimality properties were derived for a complexity measure that has been widely used in the context of worst-case optimal join algorithms, but which abstracts away factors that only depend on query size and a polylogarithmic cost factor in input size. By performing a more detailed cost analysis, we are able to uncover a previously unknown tradeoff between two incomparable enumeration approaches: one has lower complexity when the number of returned results is small, the other when it is large. We demonstrate theoretically and empirically the superiority of our techniques to batch algorithms that produce the full result and then sort it. Interestingly, our technique is not only faster for returning the first few results, but even when all results are produced.
Algebraic Amplification
Factorized Graph Representations for Semi-Supervised Learning from Sparse Data
Krishna Kumar P., Paul Langton, Wolfgang Gatterbauer
SIGMOD reproducibility award (announcement)
Introduces the concept of "algebraic amplification" with a method for parameter learning that can amplify weak signals in extremely sparsely labeled data. The resulting end-to-end accuracy is quasi identical to using gold standard parameters, while time for learning is negligible compared to inference time.
Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that iteratively pass messages along edges, starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix, which must thus be provided by either domain experts or heuristics.
We instead suggest a principled and scalable method for directly estimating the compatibilities from a sparsely labeled graph. This method, which we call distant compatibility estimation, works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We refer to algebraic amplification as the underlying idea of leveraging algebraic properties of an algorithm’s update equations to amplify sparse signals in data. We show that our estimator is by orders of magnitude faster than alternative approaches and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap pre-processing step for any existing label propagation method and removes the current dependence on heuristics.
QueryVis project page
QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster
Aristotelis Leventidis, Jiahui Zhang, Cody Dunne, Wolfgang Gatterbauer, HV Jagadish, Mirek Riedewald
SIGMOD reproducibility award (announcement)
Shows that logical diagrams automatically created from SQL queries help users understand the queries faster and with fewer errors than SQL itself. Our ultimate goal is to allow users of SQL to reason about queries in terms of "diagrammatic SQL patterns" based on first-order logic.
Understanding the meaning of existing SQL queries is critical for maintaining and reusing them. Yet SQL can be hard to read, even for expert users, and even for the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in automatically-generated visual diagrams that can help users understand the meaning of queries faster and more accurately than reading SQL alone.
We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture a substantial fragment of SQL (in particular, deeply nested SQL queries). Our diagrams build upon a rich history of diagrammatic reasoning systems in logics and were designed using a large body of human-computer interaction best practices: they are minimal in that no visual element is superfluous; they are unambiguous in that no two queries with different semantics map to the same visualization; and they extend visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation of our visual diagrams on Amazon Turk (n = 42) shows that with only a 2–3 minute static tutorial, participants could interpret queries meaningfully faster with our automatically-generated diagrams than when reading SQL alone. Moreover, we have some evidence that our visual diagrams result in participants making fewer errors than with SQL.
We believe that more regular exposure to diagrammatic representations of SQL can give rise to a pattern-based and thus more intuitive use and re-use of SQL.
Optimal bandjoins
Near-Optimal Distributed Band-Joins through Recursive Partitioning
Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald
Proposes a method for performing band-joins that significantly improves over previous work and leads to near-optimal data partitions in a distributed environment. Extensive experiments show the approach to be within 10% of the theoretic lower bound for both maximum load per worker and input duplication for a broad range of settings.
We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance).
Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.
Causality
New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou
Introduces "independent join paths" as a unifying hardness criterion for hardness of resilience for conjunctive queries with or without self-joins.
The resilience of a Boolean query is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects.
In this paper, we give several novel results on the hardness of the resilience problem for binary conjunctive queries with self-joins (i.e. conjunctive queries with relations of maximal arity 2) with one repeated relation. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various NP-hardness results. We also give novel involved reductions to network flow to show certain cases are in P. Overall, we give a dichotomy result for the restricted setting when one relation is repeated at most 2 times, and we cover many of the cases for 3.

2019

Anytime Approximation
Anytime Approximation in Probabilistic Databases via Scaled Dissociations
Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gatterbauer, Floris Geerts, Martin Theobald
Develops a general anytime approximation schema for approximate probabilistic inference. Uses the TODS 2015 dissociation-based bounds that are provably tighter than model-based bounds. Experiments show speed up over prior model-based approximations by up to 3 orders of magnitude.
Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMC style sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.
Algorithms for Automatic Ranking of Participants and Tasks in an Anonymized Contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
Special issue on selected papers from WALCOM 2017 (announcement)
Motivated by the problem of automatic grading of students, introduces several variants of the chain editing problem and determines their computational complexities
We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of participants and a set of tasks that every participant attempts. For each participant-task pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in their difficulty to be solved. In an ideal world, stronger participants should succeed at a superset of tasks that weaker participants succeed at. Similarly, easier tasks should be completed successfully by a superset of participants who succeed at harder tasks. In reality, it can happen that a stronger participant fails at a task that a weaker participants succeeds at. Our goal is to find a perfect nesting of the participant-task relations by flipping a minimum number of participant-task relations, implying such a "nearest perfect ordering" to be the one that is closest to the truth of participant strengths and task difficulties. Many variants of the problem are known to be NP-hard.
Algebraic Approximations of the Probability of Boolean Functions
Wolfgang Gatterbauer

2018

Optimal Oblvious bounds
Dissociation-based Oblivious Bounds for Weighted Model Counting
Li Chou, Wolfgang Gatterbauer, Vibhav Gogate
Generalizes the oblivious bounds from TODS 2014 from monotone to non-monotone Boolean functions.
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.
Ranked Enumeration
Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs
Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K. Nicholson, Mirek Riedewald, Alessandra Sala
Gives a ranked enumeration algorithm for tree patterns (or acyclic graph queries). Introduces the idea of any-k: a marriage between anytime algorithms and top-k evaluation.
Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to find the top-k matches according to a ranking function over edge and node weights. For users, it is difficult to select value k. We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, return as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.
Ranked Enumeration
Any-k Algorithms for Exploratory Analysis with Conjunctive Queries
Xiaofeng Yang, Mirek Riedewald, Rundong Li, Wolfgang Gatterbauer
Introduces the idea of ranked enumeration of answers to cyclic CQs based on a union over multiple tree decompositions.
We recently proposed the notion of any-k queries, together with the KARPET algorithm, for tree-pattern search in labeled graphs. Anyk extends top-k by not requiring a pre-specified value of k. Instead, an any-k algorithm returns as many of the top-ranked results as possible, for a given time budget. Given additional time, it produces the next-highest ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. In the latter case, any-k takes times similar to an algorithm that first produces all results and then sorts them. We summarize KARPET and argue that it can be extended to support any-k exploratory search for arbitrary conjunctive queries.
Learning From Query-Answers: A Scalable Approach to Belief Updating and Parameter Learning
Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
Special Issue on best of SIGMOD 2017 papers (announcement)
Invited and extended version of SIGMOD 2017.
Tuple-independent and disjoint-independent probabilistic databases (TI- and DI-PDBs) represent uncertain data in a factorized form as a product of independent random variables that represent either tuples (TI-PDBs) or sets of tuples (DI-PDBs). When the user submits a query, the database derives the marginal probabilities of each output-tuple, exploiting the underlying assumptions of statistical independence. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this article. We first introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we then introduce Dirichlet Probabilistic Databases (D-PDBs), a generalization of DI-PDBs with similar properties. We provide the following key contributions for both B- and D-PDBs: (i) We study the complexity of performing Bayesian belief updates and devise efficient algorithms for certain tractable classes of queries; (ii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iii) we present an algorithm for efficiently computing conditional probabilities, allowing us to efficiently implement B- and D-PDBs via a standard relational engine; and (iv) we support our conclusions with extensive experimental results.
A General Framework for Anytime Approximation in Probabilistic Databases
Maarten Van den Heuvel, Floris Geerts, Wolfgang Gatterbauer, Martin Theobald
Anytime approximation algorithms that compute the probabilities of queries over probabilistic databases can be of great use to statistical learning tasks. Those approaches have been based so far on either (i) sampling or (ii) branch-and-bound with model-based bounds. We present here a more general branch-and-bound framework that extends the possible bounds by using "dissociation," which yields tighter bounds.
PISTIS: A Conflict of Interest Declaration and Detection System for Peer Review Management
Siyuan Wu, Leong Hou U, Sourav S. Bhowmick, Wolfgang Gatterbauer
Detecting conflicts of interest (COIs) is key for guaranteeing the fairness of a peer-review process. In many conference management systems, the COIs of authors and reviewers are self-declared, and the declaration process is time consuming and potentially incomplete. To address this problem, we demonstrate a novel interactive system called PISTIS that assists the declaration process in a semi-automatic manner. Apart from keyword search and simple filtering, our system provides an interactive graphical interface that helps users explore potential COIs based on the heterogenous data sources. To simply the process of declaration, we also recommend latent COIs using a supervised ranking model that can be iteratively refined from the data collected from past declarations. We believe that PISTIS can be useful as an assistant tool in many real world conference management systems
Web Data Extraction System
Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob

2017

Approximate lifted inference
Dissociation and propagation for approximate lifted inference with standard relational database management systems
Wolfgang Gatterbauer, Dan Suciu
Special issue on best papers of VLDB 2015 (announcement)
Extends VLDB 2015 with all proofs and shows how our approach generalizes the idea of propagation algorithms from graphs to propagation on hypergraphs. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.
This paper proposes an approach to uncertain query evaluation by which every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking functions from graphs to hypergraphs. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.
Beta probabilistic databases: A scalable approach to belief updating and parameter learning
Niccolo Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
Selected among best of SIGMOD 2017 (announcement)
Introduces Beta Probabilistic Databases (B-PDBs), a generalization of tuple-independent PDBs that support both (i) belief updating and (ii) parameter learning in a principled and scalable way.
Tuple-independent probabilistic databases (TI-PDBs) handle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database derives the marginal probabilities of each output-tuple, assuming input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.
Algebraic Amplification
The linearization of belief propagation on pairwise Markov Random Fields
Wolfgang Gatterbauer
Generalizes the linearization from VLDB 2015 to arbitrary pairwise MRFs.
Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the update equations around default values, which was shown to work well for the problem of node classification. The present paper generalizes all prior work and derives an approach that approximates loopy BP on any pairwise MRF with the problem of solving a linear equation system. This approach combines exact convergence guarantees and a fast matrix implementation with the ability to model heterogenous networks. Experiments on synthetic graphs with planted edge potentials show that the linearization has comparable labeling accuracy as BP for graphs with weak potentials, while speeding-up inference by orders of magnitude.
Conflict of interest declaration and detection system in heterogeneous networks
Siyuan Wu, Leong Hou U, Sourav S Bhowmick, Wolfgang Gatterbauer
Peer review is the most critical process in evaluating an article to be accepted for publication in an academic venue. When assigning a reviewer to evaluate an article, the assignment should be aware of conflicts of interest (COIs) such that the reviews are fair to everyone. However, existing conference management systems simply ask reviewers and authors to declare their explicit COIs through a plain search user interface guided by some simple conflict rules. We argue that such declaration system is not enough to discover all latent COI cases. In this work, we study a graphical declaration system that visualizes the relationships of authors and reviewers based on a heterogeneous co-authorship network. With the help of the declarations, we attempt to detect the latent COIs automatically based on the meta-paths of a heterogeneous network.
Algorithms for automatic ranking of participants and tasks in an anonymized contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
Selected among best of WALCOM 2017 (announcement)
Motivated by the problem of automatic grading of students, introduces several variants of the chain editing problem and determines their computational complexities
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.

2015

Causality
The complexity of resilience and responsibility for self-join-free conjunctive queries
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou
Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as "the resilience of a query" and show its connections to the well-studied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for self-join-free conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source side-effects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of self-join-free conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source side-effects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.
Linearized Belief Propagation
Linearized and single-pass belief propagation
Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos
Introduces a linearization of Belief Propagation (called Linearized Belief Propagation = LinBP) that makes Loopy BP practical for node labeling: It comes with exact convergence guarantees, has a closed-form solution via intuitive matrix equations, and allows a two-line implementation in Python via sparse matrix multiplication.
How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. A well-known problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closed-form solution via intuitive matrix equations and, thus, comes with exact convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multi-class settings. Plus, it allows a compact implementation in SQL. The paper also introduces Single-pass Belief Propagation (SBP), a localized (or "myopic") version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard BP, while leading to almost identical node labels.
Approximate lifted inference
Approximate lifted inference with probabilistic databases
Wolfgang Gatterbauer, Dan Suciu
Selected among best of VLDB 2015 (announcement)
This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. (Invited to the Special Issue of VLDB Journal on VLDB 2015)
This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers.
Fault-Tolerant Entity Resolution with the Crowd
Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann

2014

Optimal Oblvious bounds
Oblivious bounds on the probability of Boolean functions
Wolfgang Gatterbauer, Dan Suciu
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Algebraic Amplification
Semi-supervised learning with heterophily
Wolfgang Gatterbauer
We propose a novel linear semi-supervised learning formulation that is derived from a solid probabilistic framework: belief propagation. We show that our formulation generalizes a number of label propagation algorithms described in the literature by allowing them to propagate generalized assumptions about influences between classes of neighboring nodes. We call this formulation Semi-Supervised Learning with Heterophily (SSL-H). We also show how the modularization matrix can be learned from observed data with a simple convex optimization framework that is inspired by locally linear embedding. We call this approach Linear Heterophily Estimation (LHE). Experiments on synthetic data show that both approaches combined can learn heterophily of agraph with 1M nodes and 10M edges in under 1min.

2013

Deregulated Electricity Markets
Counterexamples to commonly held assumptions on unit commitment and market power assessment
Wolfgang Gatterbauer, Marija Ilic
Chapter 10 of: Engineering IT-Enabled Sustainable Electricity Services, Marija Ilic, Le Xie, Qizing Liu (Eds.), Springer, 2013.
Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues. In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.

2012

Querying provenance for ranking and recommending
Zachary Ives, Andreas Haeberlen, Tao Feng, Wolfgang Gatterbauer

2011

Default-all is dangerous!
Wolfgang Gatterbauer, Alexandra Meliou, Dan Suciu
Knowledge Aquisition
Rules of thumb for information acquisition from large and redundant data
Wolfgang Gatterbauer
This paper shows that the "default-all propagation" scheme for database annotations can have some problematic semantic consequences. We propose an alternative "minimum-propagation" provenance that fixes the issue and that comes with several desirable properties.
QueryVis project page
QueryViz: Helping users understand SQL queries and their patterns
Jonathan Danaparamita, Wolfgang Gatterbauer
Proposes QueryViz, a light-weight add-on to existing databases. Copy your SQL query into the interactive interface at http://queryviz.com and look at the visualization of the query's relational pattern.
We present QueryViz, a novel visualization tool for SQL queries that reduces the time needed to read and understand existing queries. It targets two principal audiences: (i) users who often issue the same or similar queries and who need to quickly browse through a repository of existing queries; and (ii) novices that try to familiarize themselves with the logic behind alternative patterns of SQL queries. QueryViz uses as input only two strings: the database schema and the SQL query. It can thus serve as light-weight add-on to existing database systems and is also available via an online interface at http://queryviz.com. In this demonstration, we explain our visual alphabet, walk through the visualization algorithm, and let users experience the difference in understanding SQL queries from text or our graphical representation while browsing through repositories of well-known textbook SQL queries.
Tracing data errors with view-conditioned causality
Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu
This paper shows how causal reasoning can be used for post-factum data cleaning, where errors are detected after data has been transformed (e.g. by a query) and which need to be corrected in the original input data. We achieve this with a novel way for translating the problem into a SAT and a weighted MAX-SAT problem, and then using very effective existing tools.
QueryVis project page
Databases will visualize queries too
Wolfgang Gatterbauer
Describes a new human-query interaction in which users re-use existing queries as templates to compose their own queries. This interaction is made possible with new automatic query visualization tools (such as QueryViz, or now QueryVis) which help users understand SQL patterns quickly.
Visual Query Languages study ways to help users compose queries with visual metaphors. Information Visualization studies automatic visualization techniques to help users understand and analyze data. Query Management focuses on ways to help users manage and re-use existing queries. We observe that there is a related research question across those three topics which has not received much attention, namely that of Query Visualization: How to visually represent a query to help users quickly understand its intent? Here we argue that the involved challenges are still markedly different from those of the other three, that a solution can considerably improve the usability of DBMSs, and that the topic is thus worthy of attention. We envision, that in a few years, there will be free, modular, and lightweight tools available that allow users to visualize and interpret their queries.
Managing structured collections of community data
Wolfgang Gatterbauer, Dan Suciu
Causality
Reverse data management
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu
Causality
Bringing provenance to its full potential using causal reasoning
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu
Session-based browsing for better query reuse
Nodira Khoussainova, Yongchul Kwon, Wei-Ting Liao, Magdalena Balazinska, Wolfgang Gatterbauer, Dan Suciu

2010

Conflict resolution
Data conflict resolution using trust mappings
Wolfgang Gatterbauer, Dan Suciu
Causality
The complexity of causality and responsibility for query answers and non-answers
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu
Dissociation and propagation for efficient query evaluation over probabilistic databases
Wolfgang Gatterbauer, Abhay K. Jha, Dan Suciu
MUD 2010, pp. 83-97
Causality
Causality in databases
Alexandra Meliou, Wolfgang Gatterbauer, Joseph Halpern, Christoph Koch, Katherine F. Moore, Dan Suciu
Causality
Why so? or Why no? Functional causality for explaining query answers
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu
MUD 2010, pp. 3-17
Visual structure-based web page clustering and retrieval
Paul Bohunsky, Wolfgang Gatterbauer

2009

Conflict resolution
Believe it or not: Adding belief annotations to databases
Wolfgang Gatterbauer, Magda Balazinska, Nodira Khoussainova, Dan Suciu
Integrating and ranking uncertain scientific data
Landon Detwiler, Wolfgang Gatterbauer, Brent Louie, Dan Suciu, Peter Tarczy-Hornoch
Collaborative Query Management
A Case for a collaborative query management system
Nodira Khoussainova, Magdalena Balazinska, Wolfgang Gatterbauer, YongChul Kwon, Dan Suciu
Web Data Extraction System
Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob

2007

Web tables
Towards Domain-Independent Information Extraction from Web Tables
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, Bernhard Pollak
Creating Permanent Test Collections of Web Pages for Information Extraction Research
Bernhard Pollak, Wolfgang Gatterbauer
SOFSEM 2007, Volume II, pp. 103-115
WebPageDump is a Firefox extension which allows you to save local copies of pages from the Web. It sounds simple, but it's not. The standard "Save page as" function of web browsers fails with most dynamic web pages and this shortcoming was a serious problem for our research. Comes the birth of WebPageDump. We hope you find it useful too.

2006

Web tables
Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model
Wolfgang Gatterbauer, Paul Bohunsky
Knowledge Aquisition
Estimating Required Recall for Successful Knowledge Acquisition from the Web
Wolfgang Gatterbauer

2005

Using visual cues for extraction of tabular data from arbitrary HTML documents
Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer
Web information extraction using eupeptic data in Web tables
Wolfgang Gatterbauer, Bernhard Krüpl, Wolfgang Holzinger, Marcus Herzog

2002 (Electrical Engineering)

Deregulated Electricity Markets
Counterexamples to commonly held assumptions on unit commitment and market power assessment
Wolfgang Gatterbauer, Marija Ilic
Deregulated Electricity Markets
Interdependencies of electricity market characteristics and bidding strategies of power producers
Wolfgang Gatterbauer
Master's thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2002
Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues.
In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.
Finally, by taking statistic uncertainties and inter-temporal interdependencies into account, we show in the third part that a generator owner's optimum bid sequence for a centralized auction market is generally above marginal cost. In contrast to current literature, this is true even where absolutely no abuse of market power is involved. We conclude that marginal production costs cannot be used by themselves as baseline for the assessment of market power in electricity markets.

2000 (Mechanical Engineering)

Der Graz Cycle für Industriekraftwerke gefeuert mit Brenngasen aus Kohle- und Schwerölvergasung
(Combining the Graz Cycle with coal and heavy oil gasification for industrial power stations)
Herbert Jericha, Armin Lukasser, Wolfgang Gatterbauer
VDI-Berichte Nr. 1566, Gas Turbines for Combined Cycles, pp. 177-185, Essen, Germany, September 2000

Other links