Wolfgang Gatterbauer, Northeastern University, Publications

On the Reasonable Effectiveness of Relational Diagrams: Explaining Relational Query Patterns and the Pattern Expressiveness of Relational Languages

Wolfgang Gatterbauer, Cody Dunne

PACMMOD (SIGMOD'24), 2(1), Article 61, February 2024.

Project page: Relational Diagrams

SIGMOD 2024 best paper honorable mention (1/3) (announcement)

Proposes a semantic definition of relational query patterns, which allows us to analyze the relative pattern expressiveness of relational query languages. Also proposes "relational diagrams", a natural diagrammatic representation of tuple relational calculus.

Comparing relational languages by their logical expressiveness is well understood. Less well understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than “a certain way of writing a query”? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure-preserving mappings between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational language fragments and create a hierarchy of languages with equal logical expressiveness yet different pattern expressiveness. Notably, for the non-disjunctive language fragment, we show that relational calculus can express a larger class of patterns than the basic operators of relational algebra.
Our language-independent definition of query patterns opens novel paths for assisting database users. For example, these patterns could be leveraged to create visual query representations that faithfully represent query patterns, speed up interpretation, and provide visual feedback during query editing. As a concrete example, we propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus that is provably (𝑖) unambiguous, (𝑖𝑖) relationally complete, and (𝑖𝑖𝑖) able to represent all query patterns for unions of non-disjunctive queries. Among all diagrammatic representations for relational queries that we are aware of, ours is the only one with these three properties. Furthermore, our anonymously preregistered user study shows that Relational Diagrams allow users to recognize patterns meaningfully faster and more accurately than SQL.

A Comprehensive Tutorial on over 100 years of Diagrammatic Representations of Logical Statements and Relational Queries

Wolfgang Gatterbauer

ICDE 2024.

preprint | arXiv:2404.00007 | Tutorial slides (515 pages, 50 MB) | gs

Tutorial page: Diagrammatic representation tutorial, Project page: Relational Diagrams

Surveys the key visual metaphors developed for visual representations of relational expressions.

Query formulation is increasingly performed by systems that need to guess a user’s intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the “right” query? More generally, given that relational queries can become pretty complicated, how can we help users understand relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries.
This lecture-style tutorial surveys the key visual metaphors developed for diagrammatic representations of logical statements and relational expressions, across both the relational database and the much older diagrammatic reasoning communities. We survey the history and state-of-the-art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigations into diagrammatic languages, and organize the landscape by mapping the visual alphabets of diagrammatic representation systems to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC). Tutorial website: https://northeastern-datalab.github.io/diagrammatic-representation-tutorial/

A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations

Neha Makhija, Wolfgang Gatterbauer

PACMMOD (SIGMOD'24), 1(4), Article 228, December 2023.

Project page: Unified Reverse Data Management

Proposes a unified approach for solving reverse data management problems (including various interventions for fairness): all cases (including self-joins, bags, PTIME or NP-complete cases) are solved with the same algorithm. The algorithm just "happens" to terminate in PTIME for all known PTIME cases. Also suggests a non-trivial Disjunctive Logic Program that can automatically find "hardness certificates" (think gadgets) for conjunctive queries.

Resilience is one of the key algorithmic problems underlying various forms of reverse data management (such as view maintenance, deletion propagation, and various interventions for fairness): What is the minimal number of tuples to delete from a database in order to remove all answers from a query? A long-open question is determining those conjunctive queries (CQs) for which this problem can be solved in guaranteed PTIME. We shed new light on this and the related problem of causal responsibility by proposing a unified Integer Linear Programming (ILP) formulation. It is unified in that it can solve both prior studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (e.g., all CQs under set or bag semantics It is also unified in that all queries and all instances are treated with the same approach, and the algorithm is guaranteed to terminate in PTIME for the easy cases. We prove that, for all easy self-join-free CQs, the Linear Programming (LP) relaxation of our encoding is identical to the ILP solution and thus standard ILP solvers are guaranteed to return the solution in PTIME. Our approach opens up the door to new variants and new fine-grained analysis: 1) It also works under bag semantics and we give the first dichotomy result for bags semantics in the problem space. 2) We give a more fine-grained analysis of the complexity of causal responsibility. 3) We recover easy instances for generally hard queries, such as instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture from PODS 2020. 5) Experiments confirm that our results indeed predict the asymptotic running times, and that our universal ILP encoding is at times even faster to solve for the PTIME cases than a prior proposed dedicated flow algorithm.

Minimally Factorizing the Provenance of Self-Join Free Conjunctive Queries

Neha Makhija, Wolfgang Gatterbauer

PACMMOD (PODS'24), 2(2), Article 104, May 2024.

highlight ACM | preprint | arXiv:2105.14307 | bib

Project page: Unified Reverse Data Management

Proposes the problem of finding the minimal-size factorization of the provenance of queries. Shows it to be a natural generalization and to recover all known PTIME cases of the problem of exact probabilistic inference. Makes first steps towards a dichotomy of the data complexity of the problem.

We consider the problem of finding the minimal-size factorization of the provenance of self-join-free conjunctive queries, i.e., we want to find an equivalent propositional formula that minimizes the number of variable occurrences. Our work is partly motivated from probabilistic inference where read-once formulas are known to allow exact PTIME solutions and non-read-once formulas allow approximate solutions with an error that depends on the number of repetitions of variables. We embark on the challenge of characterizing the data complexity of this problem and show its connection to the query resilience problem. While the problem is NP-complete in general, we develop an encoding as max-flow problem that is guaranteed to give the exact solution for several queries (and otherwise approximate minimizations). We show that our encoding is guaranteed to return a read-once factorization if it exists. Our problem and approach is a complete solution that naturally recovers exact solutions for all known PTIME cases, as well as identifying additional queries for which the problem can be solved in PTIME.

HITSNDIFFS: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property

Zixuan Chen, Subhodeep Mitra, R Ravi, Wolfgang Gatterbauer

ICDE 2024.

preprint | arXiv:2401.00013 (long version) | Code on Github | gs

Unifies work on matrices with the Consecutive Ones Property (C1P) with Item Response Theory (IRT) by proposing a new variant of HITS which is guaranteed to return a C1P permutation of a matrix if it exists. We show the algorithm's potential for "ability discovery", the dual problem of "truth discovery" in the context of crowd sourcing.

We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the “truth”), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem “ability discovery” to emphasize the connection to and duality with the more well-studied problem of “truth discovery”.
To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call “HITSnDIFFs” (or HnD), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), H N D also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that H N D produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm.

A Tutorial on Visual Representations of Relational Queries

Wolfgang Gatterbauer

PVLDB, 16(12), pp. 3890-3893, 2023.

ACM | PVLDB | arXiv:2308.10319 | Tutorial slides (306 pages, 35 MB) | gs | bib

Tutorial page: Visual query representations, Project page: Relational Diagrams

Surveys the key visual metaphors developed for visual representations of relational expressions.

Query formulation is increasingly performed by systems that need to guess a user’s intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the “right” query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries.
This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).

Towards Unbiased Exploration in Partial Label Learning

Zsolt Zombori, Agapi Rissaki, Kristóf Szabó, Wolfgang Gatterbauer, Michael Benedikt

Arxiv 2307.00465, 2023.

arXiv:2307.00465

Shows that standard cross-entropy as loss function can lead to a winner-takes-all phenomenon for supervised learning under random initialization. Proposes a variant of entropy regularization that fixes the problem for disjunctive supervision.

We consider learning a probabilistic classifier from partially-labelled supervision (inputs denoted with multiple possibilities) using standard neural architectures with a softmax as the final layer. We identify a bias phenomenon that can arise from the softmax layer in even simple architectures that prevents proper exploration of alternative options, making the dynamics of gradient descent overly sensitive to initialisation. We introduce a novel loss function that allows for unbiased exploration within the space of alternative outputs. We give a theoretical justification for our loss function, and provide an extensive evaluation of its impact on synthetic data, on standard partially labelled benchmarks and on a contributed novel benchmark related to an existing rule learning challenge.

DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation

Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

TODS, 48(3), Article 9, pp 1-40, 2023.

ACM | bib

Project page: Table-as-query

Invited version from EDBT 2021 best paper award

Proposes an approach to detect homographs (i.e. data values with multiple meanings) in table repositories. Our idea is simple and surprisingly effective: we apply ideas from community detection in networks to a network representation of tables.

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values.
DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top200 precision is 89%.

SANTOS: Relationship-based Semantic Table Union Search

Aamod Khatiwada, Grace Fan, Zixuan Chen, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

PACMMOD (SIGMOD'23), 1(1), Article 9, May 2023.

ACM | arXiv:2209.13589 | video (12min) | gs | bib

Project page: Table-as-query

Proposes the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Interestingly, the necessary synthetic knowledge base can be created from the very data lake we are trying to integrate.

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns: The first uses an existing knowledge base (KB), the second (which we call a “synthesized KB”) uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm called SANTOS outperforms a state- of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically in all benchmarks that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KB from the data lakes having limited KB coverage and perform union search over them.

Efficient Computation of Quantiles over Joins

Nikolaos Tziavelis, Nofar Carmeli, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

PODS, pp. 303–315, 2023.

ACM | arXiv:2305.16525 | HAL:lirmm-04277973 | video (18min) | gs | bib

Project page: Any-k

Introduces the concept of quantile queries (%JQs) and gives new complexity results: %JQs ask for the answer at some specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ).

We present efficient algorithms for Quantile Join Queries, abbreviated as %JQ. A %JQ asks for the answer at some specified relative position (e.g., 50% for the median) under some ordering over the answers to a Join Query (JQ). Our goal is to avoid materializing the set of all query answers, and to achieve quasilinear time in the size of the database, regardless of the total number of answers. A recent dichotomy result rules out the existence of such an algorithm for a general family of queries and orders. Specifically, for acyclic JQs without self-joins, the problem becomes intractable for ordering by sum whenever we are essentially joining more than two relations. Moreover, even for basic ranking functions beyond sum, such as min or max over different attributes, it has not been known so far whether there is any nontrivial tractable %JQ.
In this work, we develop a new approach to solving %JQ and show how this approach allows not just to recover known results, but also generalize them and resolve open cases. Our solution uses two subroutines: The first one needs to select what we call a ``pivot answer.'' The second subroutine partitions the space of query answers according to this pivot, and continues searching in one partition that is represented as new %JQ over a new database. For pivot selection, we develop an algorithm that, under mild assumptions of monotonicity, works for every ranking function. The second subroutine requires a customized construction for the specific ranking function at hand.
We show the benefit and generality of our approach by using it to establish several new complexity results. First, we prove the tractability of min and max for all acyclic JQs, thereby resolving the above question. Second, we extend the previous dichotomy for sum to all partial sums (over all subsets of the attributes). Third, we handle the intractable cases of sum by devising a deterministic approximation scheme that applies to every acyclic JQ.

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries

Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

TODS, 48(1), Article 1, pp 1-45, 2023.

ACM | arXiv:2012.11965 | HAL:hal-04278100 | gs | bib

Project page: Any-k

Invited version from best of PODS 2021

Builds upon our PODS'21 work on the complexity of direct access to a ranked list of answers to a database query. Establishes a complete dichotomy of the selection problem with lexicographic orders.

We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task of enumerating query answers of free-connex acyclic CQs, so that the access time is logarithmic. Moreover, it follows from past results that within the class of CQs without self-joins, being free-connex acyclic is necessary for the existence of such a construction (under conventional assumptions in fine-grained complexity).
In this work, we embark on the challenge of identifying the answer orderings that allow for ranked direct access with the above complexity guarantees. We begin with the class of lexicographic orderings and give a decidable characterization of the class of feasible such orderings for every CQ without self-joins. We then continue to the more general case of orderings by the sum of attribute scores. As it turns out, in this case ranked direct access is feasible only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only one single answer (i.e., finding the answer at a given position). We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.

Integrating Data Lake Tables

Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller

PVLDB 16(4):932-945, 2022.

ACM | VLDB | gs | bib

Project page: Table-as-query

Proposes a heuristic algorithm to compute Full Disjunction (an associative variant of outer joins) by using complementation and subsumption operators.

We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980’s, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, a new proposal for integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables.

Principles of Query Visualization

Wolfgang Gatterbauer, Cody Dunne, H.V. Jagadish, Mirek Riedewald

IEEE Data Engineering Bulletin, Special Issue on Widening the Impact of Data Engineering through Innovations in Education, Interfaces, and Features, 45(3):47-67, 2022.

IEEE | arXiv:2208.01613 | gs | bib

Project page: QueryVis

Discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data

Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.

Interpreting and understanding relational database queries using diagrams

Wolfgang Gatterbauer

Tutorial @ DIAGRAMS, 2022.

Project page: QueryVis

Any-k Algorithms for Enumerating Ranked Answers to Conjunctive Queries

Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

arXiv:2205.05649, 2022.

highlight arXiv:2205.05649 | talk by Nikos @ Simons Institute (32min)

Project page: Any-k

Proposes the currently fastest ranked enumeration approach called Anyk-Part++. This approach asymptotically dominates all prior approaches both for TTF (Time-To-First) and TTL (Time-To-Last).

We study ranked enumeration for Conjunctive Queries (CQs) where the answers are ordered by a given ranking function (e.g., an ORDER BY clause in SQL). We develop "any-k" algorithms which, without knowing the number k of desired answers, push the ranking into joins and avoid materializing the join output earlier than necessary. For this to be possible, the ranking function needs to obey a certain kind of monotonicity; the supported ranking functions include the common sum-of-weights case where query answers are compared by sums of input weights, as well as any commutative selective dioid. One core insight of our work is that the problem is closely related to the fundamental task of path enumeration in a weighted DAG. We generalize and improve upon classic research on finding the k'th shortest path and unify into the same framework several solutions from different areas that had been studied in isolation. For the time to the k'th ranked CQ answer (for every value of k), our approach is optimal in data complexity precisely for the same class of queries where unranked enumeration is optimal -- and only slower by a logarithmic factor. In a more careful analysis of combined complexity, we uncover a previously unknown tradeoff between two different any-k algorithms: one has lower complexity when the number of returned results is small, the other when the number is very large. This tradeoff is eliminated under a stricter monotonicity property that we exploit to design a novel algorithm that asymptotically dominates all previously known alternatives, including the well-known algorithm of Eppstein for sum-of-weights path enumeration. We empirically demonstrate the findings of our theoretical analysis in an experimental study that highlights the superiority of our approach over the join-then-rank approach that existing database systems typically follow.

Toward Responsive DBMS: Optimal Join Algorithms, Enumeration, Factorization, Ranking, and Dynamic Programming

Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

ICDE tutorials, 2022, pp. 3205-3208, 2022.

Tutorial page: Towards Responsive DBMS, Project page: Any-k

This tutorial shows interesting connections between a wide range of fundamental problems in computer science, including dynamic programming, factorized representations, join algorithms, and answer enumeration.

When processing join queries over big data, a DBMS can become unresponsive, i.e., it takes very long until any output tuples appear. Ranked enumeration addresses this problem by attempting to return the most important answers as quickly as possible, ideally in time that is linear (or quasilinear) in input size, even if the complete output is much larger. Aside from its practical usefulness, ranked enumeration is closely related to, and in a way unifies, several other problems involving joins. The common goal is the design of optimal algorithms that are guaranteed to avoid large intermediate results and thus achieve time or space complexity close to a lower bound. Arguably, avoiding query plans that produce huge intermediate results has been an overarching goal of database optimizers, which is part of the reason why optimal join algorithms, enumeration, and factorized representations have generated a lot of excitement. In this tutorial, we embark on an exploration of these topics, showing how they are intimately connected with a wide range of fundamental problems in computer science.

Beyond Equi-joins: Ranking, Enumeration and Factorization

Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

PVLDB, 14(11): 2599-2612, 2021.

Project page: Any-k

Develops methods for ranked enumeration for answers to queries with general join predicates involving conjunctions and disjunctions of inequalities. Surprisingly, the suggested solution comes to within a polylogarithmic factor of the best-known guarantee for equi-joins.

We study full acyclic join queries with general join predicates that involve conjunctions and disjunctions of inequalities, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach offers strong time and space complexity guarantees in the standard RAM model of computation, getting surprisingly close to those of equi-joins. With n denoting the number of tuples in the database, we guarantee that for every value of k, the k top-ranked answers are returned in O(n polylog n + k log k) time and O(n polylog n + k) space. This is within a polylogarithmic factor of the best-known guarantee for equi-joins and even O(n+k), the time it takes to look at the input and output k answers. The key ingredient is an O(n polylog n)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. As a side benefit, our techniques are also applicable to unranked enumeration (where answers can be returned in any order) for joins with inequalities, returning k answers in O(n polylog n + k). This guarantee improves over the state of the art for large values of k. In an experimental study, we show that our ranked-enumeration approach is not only theoretically interesting, but also fast and memory-efficient in practice.

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries

Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

PODS, pp. 325-341, 2021.

Project page: Any-k

Selected among best of PODS 2021 (1/4)

Poses a new question and gives partial answers: When can we allow for direct access to a ranked list of answers to a database query without (and by being considerably faster than) materializing all answers?

We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task of enumerating query answers of free-connex acyclic CQs, so that the access time is logarithmic. Moreover, it follows from past results that within the class of CQs without self-joins, being free-connex acyclic is necessary for the existence of such a construction (under conventional assumptions in fine-grained complexity).
In this work, we embark on the challenge of identifying the answer orderings that allow for ranked direct access with the above complexity guarantees. We begin with the class of lexicographic orderings and give a decidable characterization of the class of feasible such orderings for every CQ without self-joins. We then continue to the more general case of orderings by the sum of attribute scores. As it turns out, in this case ranked direct access is feasible only in trivial cases. Hence, to better understand the computational challenge at hand, we consider the more modest task of providing access to only one single answer (i.e., finding the answer at a given position). We indeed achieve a quasilinear-time algorithm for a subset of the class of full CQs without self-joins, by adopting a solution of Frederickson and Johnson to the classic problem of selection over sorted matrices. We further prove that none of the other queries in this class admit such an algorithm.

DomainNet: Homograph Detection for Data Lake Disambiguation

Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

EDBT, pp. 13-24, 2021.

DOI | OP | preprint | arXiv:2103.09940 | video (10min) | Code on Github | datasets | gs | bib

Project page: Table-as-query

EDBT 2021 best paper award (1/1) (announcement)

Proposes an approach to detect homographs (i.e. data values with multiple meanings) in table repositories. Our idea is simple and surprisingly effective: we apply ideas from community detection in networks to a network representation of tables.

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values.
DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top200 precision is 89%.

Stratisfimal Layout: A modular optimization model for laying out layered node-link network visualizations

Sara Di Bartolomeo, Mirek Riedewald, Wolfgang Gatterbauer, and Cody Dunne

IEEE VIS, Vol 28, Issue 1, pp. 324 - 334, 2021.

IEEE | OSF preprint (long version) | OSF supplements | gs | bib

Project pages: Stratisfimal Layout, QueryVis

Proposes a comprehensive approach to make complicated networks (such as those with layers and grouped nodes) easier to read. The idea is to formulate various readability criteria (notably crossing minimization and bendiness reduction) as a modular and customizable constraint optimization problem.

Node-link visualizations are a familiar and powerful tool for displaying the relationships in a network. The readability of these visualizations highly depends on the spatial layout used for the nodes. In this paper, we focus on computing layered layouts, in which nodes are aligned on a set of parallel axes to better expose hierarchical or sequential relationships. Heuristic-based layouts are widely used as they scale well to larger networks and usually create readable, albeit sub-optimal, visualizations. We instead use a layout optimization model that prioritizes optimality --- as compared to scalability --- because an optimal solution not only represents the best attainable result, but can also serve as a baseline to evaluate the effectiveness of layout heuristics. We take an important step towards powerful and flexible network visualization by proposing STRATIFIMAL LAYOUT, a modular integer-linear-programming formulation that can consider several important readability criteria simultaneously --- crossing reduction, edge bendiness, and nested and multi-layer groups. The layout can be adapted to diverse use cases through its modularity. Individual features can be enabled and customized depending on the application. We provide open-source and documented implementations of the layout, both for web-based and desktop visualizations. As a proof-of-concept, we apply it to the problem of visualizing complicated SQL queries, which have features that we believe cannot be addressed by existing layout optimization models. We also include a benchmark network generator and the results of an empirical evaluation to assess the performance trade-offs of our design choices.

Optimal Join Algorithms meet Top-k

Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

SIGMOD tutorials, pp. 2659–2665, 2020.

Tutorial page: Optimal Join Algorithms meet Top-k, Project page: Any-k

Presents a unified view for studying optimal join algorithms and top-k query evaluation.

Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.

Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries

Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, Xiaofeng Yang

PVLDB, 13(9):1582–1597, 2020.

Project page: Any-k

First paper that generalizes dynamic programming to optimal ranked enumeration of answers. The theory applies to very general algebraic structures called "selective dioids" (semirings with an ordering property). This paper also applies the theory to the special case of ranked enumeration over full conjunctive queries. We call it "any-k" to emphasize the marriage of "anytime algorithms" with top-k. The full version also treats projections.

We study ranked enumeration of the results to a join query in order of decreasing importance, as imposed by an appropriate ranking function. Our main contribution is a framework for ranked enumeration over a class of Dynamic Programming problems that generalizes seemingly different problems that to date had been studied in isolation. To this end, we study and extend classic algorithms that find the k-shortest paths in a weighted graph. For full conjunctive queries, including cyclic ones, our approach is asymptotically optimal in terms of (1) time to return the top result, (2) delay between results, and (3) time until all results are returned in order. These optimality properties were derived for a complexity measure that has been widely used in the context of worst-case optimal join algorithms, but which abstracts away factors that only depend on query size and a polylogarithmic cost factor in input size. By performing a more detailed cost analysis, we are able to uncover a previously unknown tradeoff between two incomparable enumeration approaches: one has lower complexity when the number of returned results is small, the other when it is large. We demonstrate theoretically and empirically the superiority of our techniques to batch algorithms that produce the full result and then sort it. Interestingly, our technique is not only faster for returning the first few results, but even when all results are produced.

Factorized Graph Representations for Semi-Supervised Learning from Sparse Data

Krishna Kumar P., Paul Langton, Wolfgang Gatterbauer

SIGMOD, pp. 1383–1398, 2020.

Earlier working paper arXiv:1412.3100: Semi-supervised learning with Heterophily (SSLH)

Project page: Semi-supervised learning with heterophily (SSLH)

SIGMOD reproducibility award (1/3) (announcement)

Introduces the concept of "algebraic amplification" with a method for parameter learning that can amplify weak signals in extremely sparsely labeled data. The resulting end-to-end accuracy is quasi identical to using gold standard parameters, while time for learning is negligible compared to inference time.

Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that iteratively pass messages along edges, starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix, which must thus be provided by either domain experts or heuristics.
We instead suggest a principled and scalable method for directly estimating the compatibilities from a sparsely labeled graph. This method, which we call distant compatibility estimation, works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We refer to algebraic amplification as the underlying idea of leveraging algebraic properties of an algorithm’s update equations to amplify sparse signals in data. We show that our estimator is by orders of magnitude faster than alternative approaches and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap pre-processing step for any existing label propagation method and removes the current dependence on heuristics.

QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster

Aristotelis Leventidis, Jiahui Zhang, Cody Dunne, Wolfgang Gatterbauer, HV Jagadish, Mirek Riedewald

SIGMOD, pp. 2303–2318, 2020.

Earlier presentation: VLDB 2011 vision paper: video (19min) | PPTX | narrated PPTX slides (16MB) | slides

Project page: QueryVis, Online demo: Online QueryVis (from EDBT 2011)

SIGMOD reproducibility award (1/3) (announcement)

Shows that logical diagrams automatically created from SQL queries help users understand the queries faster and with fewer errors than SQL itself. Our ultimate goal is to allow users of SQL to reason about queries in terms of "diagrammatic SQL patterns" based on first-order logic.

Understanding the meaning of existing SQL queries is critical for maintaining and reusing them. Yet SQL can be hard to read, even for expert users, and even for the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in automatically-generated visual diagrams that can help users understand the meaning of queries faster and more accurately than reading SQL alone.
We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture a substantial fragment of SQL (in particular, deeply nested SQL queries). Our diagrams build upon a rich history of diagrammatic reasoning systems in logics and were designed using a large body of human-computer interaction best practices: they are minimal in that no visual element is superfluous; they are unambiguous in that no two queries with different semantics map to the same visualization; and they extend visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation of our visual diagrams on Amazon Turk (n = 42) shows that with only a 2–3 minute static tutorial, participants could interpret queries meaningfully faster with our automatically-generated diagrams than when reading SQL alone. Moreover, we have some evidence that our visual diagrams result in participants making fewer errors than with SQL.
We believe that more regular exposure to diagrammatic representations of SQL can give rise to a pattern-based and thus more intuitive use and re-use of SQL.

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald

SIGMOD, pp. 2375–2390, 2020.

ACM | preprint | arXiv:2004.06101 (long version) | video (12min) | gs | bib

Project page: Distributed computing

Proposes a method for performing band-joins that significantly improves over previous work and leads to near-optimal data partitions in a distributed environment. Extensive experiments show the approach to be within 10% of the theoretic lower bound for both maximum load per worker and input duplication for a broad range of settings.

We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance).
Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.

New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins

Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

PODS, pp. 271–284, 2020.

Project page: Causality

Introduces "independent join paths" as a unifying hardness criterion for hardness of resilience for conjunctive queries with or without self-joins.

The resilience of a Boolean query is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects.
In this paper, we give several novel results on the hardness of the resilience problem for binary conjunctive queries with self-joins (i.e. conjunctive queries with relations of maximal arity 2) with one repeated relation. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various NP-hardness results. We also give novel involved reductions to network flow to show certain cases are in P. Overall, we give a dichotomy result for the restricted setting when one relation is repeated at most 2 times, and we cover many of the cases for 3.

Anytime Approximation in Probabilistic Databases via Scaled Dissociations

Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gatterbauer, Floris Geerts, Martin Theobald

SIGMOD, pp. 1295-1312, 2019.

highlight ACM | preprint | slides | gs | bib

Project page: Propagation

Develops a general anytime approximation schema for approximate probabilistic inference. Uses the TODS 2015 dissociation-based bounds that are provably tighter than model-based bounds. Experiments show speed up over prior model-based approximations by up to 3 orders of magnitude.

Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMC style sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.

Algorithms for Automatic Ranking of Participants and Tasks in an Anonymized Contest

Yang Jiao, R. Ravi, Wolfgang Gatterbauer

Theoretical Computer Science, Vol 789, 15, pp. 64-76, Elsevier, 2019.

Elsevier | preprint | arXiv:1612.04794 (conference version) | gs | bib

Special issue on selected papers from WALCOM 2017 (announcement)

Motivated by the problem of automatic grading of students, introduces several variants of the chain editing problem and determines their computational complexities

We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of participants and a set of tasks that every participant attempts. For each participant-task pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in their difficulty to be solved. In an ideal world, stronger participants should succeed at a superset of tasks that weaker participants succeed at. Similarly, easier tasks should be completed successfully by a superset of participants who succeed at harder tasks. In reality, it can happen that a stronger participant fails at a task that a weaker participants succeeds at. Our goal is to find a perfect nesting of the participant-task relations by flipping a minimum number of participant-task relations, implying such a "nearest perfect ordering" to be the one that is closest to the truth of participant strengths and task difficulties. Many variants of the problem are known to be NP-hard.

Algebraic Approximations of the Probability of Boolean Functions

Wolfgang Gatterbauer

SUM (Invited Keynote), pp. 449-450, 2019.

Dissociation-based Oblivious Bounds for Weighted Model Counting

Li Chou, Wolfgang Gatterbauer, Vibhav Gogate

UAI, pp. 866-875, 2018.

highlight AUAI | preprint | gs | bib

Project page: Propagation

Generalizes the oblivious bounds from TODS 2014 from monotone to non-monotone Boolean functions.

We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.

Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs

Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K. Nicholson, Mirek Riedewald, Alessandra Sala

WWW, pp. 489-498, 2018.

ACM | preprint | arXiv:1802.06060 (long version) | gs | bib

Project page: Any-k

Gives a ranked enumeration algorithm for tree patterns (or acyclic graph queries). Introduces the idea of any-k: a marriage between anytime algorithms and top-k evaluation.

Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to find the top-k matches according to a ranking function over edge and node weights. For users, it is difficult to select value k. We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, return as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.

Any-k Algorithms for Exploratory Analysis with Conjunctive Queries

Xiaofeng Yang, Mirek Riedewald, Rundong Li, Wolfgang Gatterbauer

ExploreDB, pp. 1-3, 2018.

ACM | preprint | gs | bib

Project page: Any-k

Introduces the idea of ranked enumeration of answers to cyclic CQs based on a union over multiple tree decompositions.

We recently proposed the notion of any-k queries, together with the KARPET algorithm, for tree-pattern search in labeled graphs. Anyk extends top-k by not requiring a pre-specified value of k. Instead, an any-k algorithm returns as many of the top-ranked results as possible, for a given time budget. Given additional time, it produces the next-highest ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. In the latter case, any-k takes times similar to an algorithm that first produces all results and then sorts them. We summarize KARPET and argue that it can be extended to support any-k exploratory search for arbitrary conjunctive queries.

Learning From Query-Answers: A Scalable Approach to Belief Updating and Parameter Learning

Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer

TODS 43(4): 17:1-17:41, 2018.

ACM | preprint | gs | bib

Special Issue on best of SIGMOD 2017 papers (announcement)

Invited and extended version of SIGMOD 2017.

Tuple-independent and disjoint-independent probabilistic databases (TI- and DI-PDBs) represent uncertain data in a factorized form as a product of independent random variables that represent either tuples (TI-PDBs) or sets of tuples (DI-PDBs). When the user submits a query, the database derives the marginal probabilities of each output-tuple, exploiting the underlying assumptions of statistical independence. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this article. We first introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we then introduce Dirichlet Probabilistic Databases (D-PDBs), a generalization of DI-PDBs with similar properties. We provide the following key contributions for both B- and D-PDBs: (i) We study the complexity of performing Bayesian belief updates and devise efficient algorithms for certain tractable classes of queries; (ii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iii) we present an algorithm for efficiently computing conditional probabilities, allowing us to efficiently implement B- and D-PDBs via a standard relational engine; and (iv) we support our conclusions with extensive experimental results.

A General Framework for Anytime Approximation in Probabilistic Databases

Maarten Van den Heuvel, Floris Geerts, Wolfgang Gatterbauer, Martin Theobald

StarAI (IJCAI workshop), 2018 (Short position paper).

arXiv:1806.10078 | preprint | gs

Anytime approximation algorithms that compute the probabilities of queries over probabilistic databases can be of great use to statistical learning tasks. Those approaches have been based so far on either (i) sampling or (ii) branch-and-bound with model-based bounds. We present here a more general branch-and-bound framework that extends the possible bounds by using "dissociation," which yields tighter bounds.

PISTIS: A Conflict of Interest Declaration and Detection System for Peer Review Management

Siyuan Wu, Leong Hou U, Sourav S. Bhowmick, Wolfgang Gatterbauer

SIGMOD, pp. 1713-1716, 2018 (System demonstration).

ACM | preprint | bib

Detecting conflicts of interest (COIs) is key for guaranteeing the fairness of a peer-review process. In many conference management systems, the COIs of authors and reviewers are self-declared, and the declaration process is time consuming and potentially incomplete. To address this problem, we demonstrate a novel interactive system called PISTIS that assists the declaration process in a semi-automatic manner. Apart from keyword search and simple filtering, our system provides an interactive graphical interface that helps users explore potential COIs based on the heterogenous data sources. To simply the process of declaration, we also recommend latent COIs using a supervised ranking model that can be iteratively refined from the data collected from past declarations. We believe that PISTIS can be useful as an assistant tool in many real world conference management systems

Web Data Extraction System

Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob

Encyclopedia of Database Systems, Second Edition, Springer, 2018.

preprint | bib

Web Harvesting

Wolfgang Gatterbauer

Encyclopedia of Database Systems, Second Edition, Springer, 2018.

preprint | bib

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Wolfgang Gatterbauer, Dan Suciu

VLDBJ 26(5): pp. 5-30, 2017.

highlight Springer | arXiv:1310.6257 (long version) | gs | bib

Project page: Propagation

Special issue on best papers of VLDB 2015 (announcement)

Extends VLDB 2015 with all proofs and shows how our approach generalizes the idea of propagation algorithms from graphs to propagation on hypergraphs. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

This paper proposes an approach to uncertain query evaluation by which every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking functions from graphs to hypergraphs. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

Beta probabilistic databases: A scalable approach to belief updating and parameter learning

Niccolo Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer

SIGMOD, pp. 573-586, 2017.

ACM | preprint | gs | bib

Project page: Propagation

Selected among best of SIGMOD 2017 (announcement)

Introduces Beta Probabilistic Databases (B-PDBs), a generalization of tuple-independent PDBs that support both (i) belief updating and (ii) parameter learning in a principled and scalable way.

Tuple-independent probabilistic databases (TI-PDBs) handle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database derives the marginal probabilities of each output-tuple, assuming input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.

The linearization of belief propagation on pairwise Markov Random Fields

Wolfgang Gatterbauer

AAAI, pp. 3747-3753, 2017.

AAAI | arXiv:1502.04956 (long version) | gs | bib

Generalizes the linearization from VLDB 2015 to arbitrary pairwise MRFs.

Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the update equations around default values, which was shown to work well for the problem of node classification. The present paper generalizes all prior work and derives an approach that approximates loopy BP on any pairwise MRF with the problem of solving a linear equation system. This approach combines exact convergence guarantees and a fast matrix implementation with the ability to model heterogenous networks. Experiments on synthetic graphs with planted edge potentials show that the linearization has comparable labeling accuracy as BP for graphs with weak potentials, while speeding-up inference by orders of magnitude.

Conflict of interest declaration and detection system in heterogeneous networks

Siyuan Wu, Leong Hou U, Sourav S Bhowmick, Wolfgang Gatterbauer

CIKM, pp. 2383-2386, 2017 (Short paper).

ACM | preprint | gs | bib

Peer review is the most critical process in evaluating an article to be accepted for publication in an academic venue. When assigning a reviewer to evaluate an article, the assignment should be aware of conflicts of interest (COIs) such that the reviews are fair to everyone. However, existing conference management systems simply ask reviewers and authors to declare their explicit COIs through a plain search user interface guided by some simple conflict rules. We argue that such declaration system is not enough to discover all latent COI cases. In this work, we study a graphical declaration system that visualizes the relationships of authors and reviewers based on a heterogeneous co-authorship network. With the help of the declarations, we attempt to detect the latent COIs automatically based on the meta-paths of a heterogeneous network.

Algorithms for automatic ranking of participants and tasks in an anonymized contest

Yang Jiao, R. Ravi, Wolfgang Gatterbauer

WALCOM, pp 335-346, 2017.

Springer | arXiv:1612.04794 | gs | bib

Selected among best of WALCOM 2017 (announcement)

Motivated by the problem of automatic grading of students, introduces several variants of the chain editing problem and determines their computational complexities

We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.

The complexity of resilience and responsibility for self-join-free conjunctive queries

Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

PVLDB, 9(3):180-191, 2015.

highlight ACM | VLDB | arXiv:1507.00674 (long version) | bib

Project page: Causality

Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as "the resilience of a query" and show its connections to the well-studied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for self-join-free conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source side-effects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of self-join-free conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source side-effects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.

Linearized and single-pass belief propagation

Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos

PVLDB, 8(5):581-592, 2015.

Project page: SSL-H

Introduces a linearization of Belief Propagation (called Linearized Belief Propagation = LinBP) that makes Loopy BP practical for node labeling: It comes with exact convergence guarantees, has a closed-form solution via intuitive matrix equations, and allows a two-line implementation in Python via sparse matrix multiplication.

How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. A well-known problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closed-form solution via intuitive matrix equations and, thus, comes with exact convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multi-class settings. Plus, it allows a compact implementation in SQL. The paper also introduces Single-pass Belief Propagation (SBP), a localized (or "myopic") version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard BP, while leading to almost identical node labels.

Approximate lifted inference with probabilistic databases

Wolfgang Gatterbauer, Dan Suciu

PVLDB, 8(5):629-640, 2015.

Project page: Propagation

Selected among best of VLDB 2015 (announcement)

This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. (Invited to the Special Issue of VLDB Journal on VLDB 2015)

This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers.

Fault-Tolerant Entity Resolution with the Crowd

Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann

arXiv:1512.00537

arXiv

Oblivious bounds on the probability of Boolean functions

Wolfgang Gatterbauer, Dan Suciu

TODS 39(1): 5:1-5:34, 2014.

Prior superseded working paper "Optimal Upper and Lower Bounds for Boolean Expressions by Dissociation": arXiv:1105.2813

Project page: Propagation

This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.

This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.

Semi-supervised learning with heterophily

Wolfgang Gatterbauer

arXiv:1412.3100

Project page: SSL-H

We propose a novel linear semi-supervised learning formulation that is derived from a solid probabilistic framework: belief propagation. We show that our formulation generalizes a number of label propagation algorithms described in the literature by allowing them to propagate generalized assumptions about influences between classes of neighboring nodes. We call this formulation Semi-Supervised Learning with Heterophily (SSL-H). We also show how the modularization matrix can be learned from observed data with a simple convex optimization framework that is inspired by locally linear embedding. We call this approach Linear Heterophily Estimation (LHE). Experiments on synthetic data show that both approaches combined can learn heterophily of agraph with 1M nodes and 10M edges in under 1min.

Counterexamples to commonly held assumptions on unit commitment and market power assessment

Wolfgang Gatterbauer, Marija Ilic

Chapter 10 of: Engineering IT-Enabled Sustainable Electricity Services, Marija Ilic, Le Xie, Qizing Liu (Eds.), Springer, 2013.

chapter

Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues. In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.

Querying provenance for ranking and recommending

Zachary Ives, Andreas Haeberlen, Tao Feng, Wolfgang Gatterbauer

TaPP 2012

paper

Default-all is dangerous!

Wolfgang Gatterbauer, Alexandra Meliou, Dan Suciu

TaPP 2011

highlight paper | paper (arXiv:1105.4395) | slides | slides | bib

Project pages: BeliefDB, Causality

Rules of thumb for information acquisition from large and redundant data

Wolfgang Gatterbauer

ECIR, pp. 479-490, 2011

Project page: Unique recall

This paper shows that the "default-all propagation" scheme for database annotations can have some problematic semantic consequences. We propose an alternative "minimum-propagation" provenance that fixes the issue and that comes with several desirable properties.

QueryViz: Helping users understand SQL queries and their patterns

Jonathan Danaparamita, Wolfgang Gatterbauer

EDBT 2011, pp. 558-561 (System demonstration)

highlight ACM | Open Proceedings | preprint | gs | bib

Project page: QueryVis

Proposes QueryViz, a light-weight add-on to existing databases. Copy your SQL query into the interactive interface at http://queryviz.com and look at the visualization of the query's relational pattern.

We present QueryViz, a novel visualization tool for SQL queries that reduces the time needed to read and understand existing queries. It targets two principal audiences: (i) users who often issue the same or similar queries and who need to quickly browse through a repository of existing queries; and (ii) novices that try to familiarize themselves with the logic behind alternative patterns of SQL queries. QueryViz uses as input only two strings: the database schema and the SQL query. It can thus serve as light-weight add-on to existing database systems and is also available via an online interface at http://queryviz.com. In this demonstration, we explain our visual alphabet, walk through the visualization algorithm, and let users experience the difference in understanding SQL queries from text or our graphical representation while browsing through repositories of well-known textbook SQL queries.

Tracing data errors with view-conditioned causality

Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu

SIGMOD 2011, pp. 505-516

paper | paper (ACM version) | bib

Project page: Causality

This paper shows how causal reasoning can be used for post-factum data cleaning, where errors are detected after data has been transformed (e.g. by a query) and which need to be corrected in the original input data. We achieve this with a novel way for translating the problem into a SAT and a weighted MAX-SAT problem, and then using very effective existing tools.

Databases will visualize queries too

Wolfgang Gatterbauer

PVLDB, 4(12):1498-1501, 2011 (VLDB challenges and visions track)

Project page: QueryVis

Describes a new human-query interaction in which users re-use existing queries as templates to compose their own queries. This interaction is made possible with new automatic query visualization tools (such as QueryViz, or now QueryVis) which help users understand SQL patterns quickly.

Visual Query Languages study ways to help users compose queries with visual metaphors. Information Visualization studies automatic visualization techniques to help users understand and analyze data. Query Management focuses on ways to help users manage and re-use existing queries. We observe that there is a related research question across those three topics which has not received much attention, namely that of Query Visualization: How to visually represent a query to help users quickly understand its intent? Here we argue that the involved challenges are still markedly different from those of the other three, that a solution can considerably improve the usability of DBMSs, and that the topic is thus worthy of attention. We envision, that in a few years, there will be free, modular, and lightweight tools available that allow users to visualize and interpret their queries.

Managing structured collections of community data

Wolfgang Gatterbauer, Dan Suciu

CIDR 2011, pp. 207-210. (Outrageous ideas and vision track)

vision paper | slides | slides | bib

Project page: BeliefDB

Reverse data management

Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu

PVLDB, 4(12):1490-1493, 2011 (VLDB challenges and visions track)

ACM | PVLDB | bib

Project page: Causality

Bringing provenance to its full potential using causal reasoning

Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu

TaPP 2011

paper | bib

Project page: Causality

Session-based browsing for better query reuse

Nodira Khoussainova, Yongchul Kwon, Wei-Ting Liao, Magdalena Balazinska, Wolfgang Gatterbauer, Dan Suciu

SSDBM, pp. 583-585, 2011 (Poster)

paper | UW CSE TR 11-04-02 (full 10 page version)

Project page: CQMS

Data conflict resolution using trust mappings

Wolfgang Gatterbauer, Dan Suciu

SIGMOD 2010, pp. 219-230.

Project page: BeliefDB

The complexity of causality and responsibility for query answers and non-answers

Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu

PVLDB, 4(1):34-45, 2010

highlight paper | arXiv:1009.2021 (full version) | bib

Project page: Causality

Dissociation and propagation for efficient query evaluation over probabilistic databases

Wolfgang Gatterbauer, Abhay K. Jha, Dan Suciu

MUD 2010, pp. 83-97

paper | bib

Project page: Propagation

Causality in databases

Alexandra Meliou, Wolfgang Gatterbauer, Joseph Halpern, Christoph Koch, Katherine F. Moore, Dan Suciu

IEEE Data Engineering Bulletin, Special issue on Data Provenance, 33(3):59-67, Sept. 2010

IEEE | paper | bib

Project page: Causality

Why so? or Why no? Functional causality for explaining query answers

Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu

MUD 2010, pp. 3-17

paper | arXiv:0912.5340 (full 18 page version) | bib

Project page: Causality

Visual structure-based web page clustering and retrieval

Paul Bohunsky, Wolfgang Gatterbauer

WWW 2010, pp. 1067-1068. (Poster)

poster paper | poster paper (ACM version) | poster | bib

Former project page (not maintained anymore): VENTex

Example DIV table: TABLE vs DIV table

Believe it or not: Adding belief annotations to databases

Wolfgang Gatterbauer, Magda Balazinska, Nodira Khoussainova, Dan Suciu

PVLDB, 2(1):1-12, 2009

highlight paper | arXiv:0912.5241 (full 17 page version) | slides | slides | bib

Project page: BeliefDB

Integrating and ranking uncertain scientific data

Landon Detwiler, Wolfgang Gatterbauer, Brent Louie, Dan Suciu, Peter Tarczy-Hornoch

Project pages: BioRank/U2/Biomediator (not maintained anymore), Propagation

A Case for a collaborative query management system

Nodira Khoussainova, Magdalena Balazinska, Wolfgang Gatterbauer, YongChul Kwon, Dan Suciu

CIDR 2009 (Perspectives Track)

paper | paper (arXiv:0909.1778) | bib

Project web page: CQMS

Web Data Extraction System

Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob

Encyclopedia of Database Systems, Part 23, pp. 3465-3471, Springer, 2009

paper | bib

Web Harvesting

Wolfgang Gatterbauer

Encyclopedia of Database Systems, Part 23, pp. 3472-3473, Springer, 2009

paper | bib

Towards Domain-Independent Information Extraction from Web Tables

Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, Bernhard Pollak

WWW 2007, pp. 71-80

highlight paper | paper (ACM version) | bib | Citations

Patent 8,719,291

Copy of former Project page: VENTex

Creating Permanent Test Collections of Web Pages for Information Extraction Research

Bernhard Pollak, Wolfgang Gatterbauer

SOFSEM 2007, Volume II, pp. 103-115

paper | slides | poster | bib

Project pages: WebPageDump, VENTex

WebPageDump is a Firefox extension which allows you to save local copies of pages from the Web. It sounds simple, but it's not. The standard "Save page as" function of web browsers fails with most dynamic web pages and this shortcoming was a serious problem for our research. Comes the birth of WebPageDump. We hope you find it useful too.

Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model

Wolfgang Gatterbauer, Paul Bohunsky

AAAI, pp. 1313-1318, 2006

highlight paper | paper (AAAI version) | slides | bib

Copy of former Project page: VENTrec

Estimating Required Recall for Successful Knowledge Acquisition from the Web

Wolfgang Gatterbauer

WWW, pp. 969-970, 2006 (Poster)

poster paper | poster | poster | bib

Project page: Unique recall

Using visual cues for extraction of tabular data from arbitrary HTML documents

Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer

WWW, pp. 1000-1001, 2005 (Poster)

ACM | paper | bib

Web information extraction using eupeptic data in Web tables

Wolfgang Gatterbauer, Bernhard Krüpl, Wolfgang Holzinger, Marcus Herzog

1st International Workshop on Representation and Analysis of Web Space (RAWS), pp. 41-48, 2005

paper | bib

Counterexamples to commonly held assumptions on unit commitment and market power assessment

Wolfgang Gatterbauer, Marija Ilic

34th Annual North American Power Symposium (NAPS), pp. 219-225, 2002

paper | slides | bib

Interdependencies of electricity market characteristics and bidding strategies of power producers

Wolfgang Gatterbauer

Master's thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2002

thesis | bib

Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues.
In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.
Finally, by taking statistic uncertainties and inter-temporal interdependencies into account, we show in the third part that a generator owner's optimum bid sequence for a centralized auction market is generally above marginal cost. In contrast to current literature, this is true even where absolutely no abuse of market power is involved. We conclude that marginal production costs cannot be used by themselves as baseline for the assessment of market power in electricity markets.

Der Graz Cycle für Industriekraftwerke gefeuert mit Brenngasen aus Kohle- und Schwerölvergasung

(Combining the Graz Cycle with coal and heavy oil gasification for industrial power stations)

Herbert Jericha, Armin Lukasser, Wolfgang Gatterbauer

VDI-Berichte Nr. 1566, Gas Turbines for Combined Cycles, pp. 177-185, Essen, Germany, September 2000

paper | bib

Project page: Graz Cycle

DBLP | ORCID | ArXiv | Google Scholar | ACM profile

Publications (Hide summaries) (Show abstracts)

2024

2023

2022

2021

2020

2019

2018

2017

2015

2014

2013

2012

2011

2010

2009

2007

2006

2005

2002 (Electrical Engineering)

2000 (Mechanical Engineering)

Other links