
Anyk: Anytime topk tree pattern retrieval in labeled graphs
Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K. Nicholson, Mirek Riedewald, Alessandra Sala
WWW, pp. 489498, 2018.
[ACM],
[Full version],
[Full arXiv version],
[gs],
[bib]
Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to find the topk matches according to a ranking function over edge and node weights. For users, it is difficult to select value k. We therefore propose the novel notion of an anyk ranking algorithm: for a given time budget, return as many of the topranked results as possible. Then, given additional time, produce the next lowerranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the topranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong nontrivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.

Dissociationbased oblivious bounds for weighted model counting
Li Chou, Wolfgang Gatterbauer, Vibhav Gogate
UAI, 2018.
[UAI],
[preprint],
[gs],
[bib]
Generalizes the oblivious bounds from TODS 2014 from monotone to nonmonotone Boolean functions
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partitionbased bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the minibucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.

A General framework for anytime approximation in probabilistic databases.
Maarten Van den Heuvel, Floris Geerts, Wolfgang Gatterbauer, Martin Theobald
StarAI (IJCAI workshop), 2018 (Short position paper).
[arXiv],
[preprint],
[gs]
Anytime approximation algorithms that compute the probabilities of queries over probabilistic databases can be of great use to statistical learning tasks. Those approaches have been based so far on either (i) sampling or (ii) branchandbound with modelbased bounds. We present here a more general branchandbound framework that extends the possible bounds by using "dissociation," which yields tighter bounds.

PISTIS: A conflict of interest declaration and detection system for peer review management
Siyuan Wu, Leong Hou U, Sourav S. Bhowmick, Wolfgang Gatterbauer
SIGMOD, pp. 17131716, 2018 (System demonstration).
[ACM],
[preprint],
[bib]
Detecting conflicts of interest (COIs) is key for guaranteeing the fairness of a peerreview process. In many conference management systems, the COIs of authors and reviewers are selfdeclared, and the declaration process is time consuming and potentially incomplete. To address this problem, we demonstrate a novel interactive system called PISTIS that assists the declaration process in a semiautomatic manner. Apart from keyword search and simple filtering, our system provides an interactive graphical interface that helps users explore potential COIs based on the heterogenous data sources. To simply the process of declaration, we also recommend latent COIs using a supervised ranking model that can be iteratively refined from the data collected from past declarations. We believe that PISTIS can be useful as an assistant tool in many real world conference management systems

Algorithms for automatic ranking of participants and tasks in an anonymized contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
Theoretical Computer Science, Elsevier, 2018 (Special Issue on WALCOM 2017).
[Elsevier],
[preprint],
[arXiv old]
Invited and extended version of WALCOM 2017
We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of participants and a set of tasks that every participant attempts. For each participanttask pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in their difficulty to be solved. In an ideal world, stronger participants should succeed at a superset of tasks that weaker participants succeed at. Similarly, easier tasks should be completed successfully by a superset of participants who succeed at harder tasks. In reality, it can happen that a stronger participant fails at a task that a weaker participants succeeds at. Our goal is to find a perfect nesting of the participanttask relations by flipping a minimum number of participanttask relations, implying such a "nearest perfect ordering" to be the one that is closest to the truth of participant strengths and task difficulties. Many variants of the problem are known to be NPhard.

Learning From QueryAnswers: A Scalable Approach to Belief Updating and Parameter Learning
Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
ACM TODS, vol. 43, no. 4, 2018. (Special Issue on SIGMOD 2017).
[preprint]
Invited and extended version of SIGMOD 2017

Beta probabilistic databases: A scalable approach to belief updating and parameter learning
Niccolo Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
SIGMOD, pp. 573586, 2017.
[ACM],
[preprint],
[gs],
[bib]
Invited to the Special Issue of TODS on "best of SIGMOD 2017"
Tupleindependent probabilistic databases (TIPDBs) handle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database derives the marginal probabilities of each outputtuple, assuming inputtuples are statistically independent. While query processing in TIPDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (BPDBs), a generalization of TIPDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of BPDBs is to treat each parameter as a latent, Betadistributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a softEM algorithm for computing maximumlikelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.

The linearization of belief propagation on pairwise Markov Random Fields
Wolfgang Gatterbauer
AAAI, pp. 37473753, 2017.
[paper],
[Full arXiv version],
[bib]
Generalizes the linearization from VLDB 2015 to arbitrary pairwise MRFs
Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the update equations around default values, which was shown to work well for the problem of node classification. The present paper generalizes all prior work and derives an approach that approximates loopy BP on any pairwise MRF with the problem of solving a linear equation system. This approach combines exact convergence guarantees and a fast matrix implementation with the ability to model heterogenous networks. Experiments on synthetic graphs with planted edge potentials show that the linearization has comparable labeling accuracy as BP for graphs with weak potentials, while speedingup inference by orders of magnitude.

Conflict of interest declaration and detection system in heterogeneous networks
Siyuan Wu, Leong Hou U, Sourav S Bhowmick, Wolfgang Gatterbauer.
CIKM, pp. 23832386, 2017 (Short paper).
[ACM],
[preprint],
[gs]
[bib]
Peer review is the most critical process in evaluating an article to be accepted for publication in an academic venue. When assigning a reviewer to evaluate an article, the assignment should be aware of conflicts of interest (COIs) such that the reviews are fair to everyone. However, existing conference management systems simply ask reviewers and authors to declare their explicit COIs through a plain search user interface guided by some simple conflict rules. We argue that such declaration system is not enough to discover all latent COI cases. In this work, we study a graphical declaration system that visualizes the relationships of authors and reviewers based on a heterogeneous coauthorship network. With the help of the declarations, we attempt to detect the latent COIs automatically based on the metapaths of a heterogeneous network.

Algorithms for automatic ranking of participants and tasks in an anonymized contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
WALCOM, pp 335346, 2017.
[Springer],
[arXiv],
[bib]
Invited to the Special Issue of Elsevier TCS on WALCOM 2017
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partitionbased bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the minibucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.

Dissociation and propagation for approximate lifted inference with standard relational database management systems
Wolfgang Gatterbauer, Dan Suciu
VLDBJ (Special Issue of VLDB Journal on VLDB 2015). pp. 530, 2016.
selection
[Springer],
[Full arXiv version]
Project page: Propagation
Extends VLDB 2015 with all proofs and shows how the approch generalizes the idea of "graph propagation" to propagation on hypergraphs
This paper proposes an approach to uncertain query evaluation by which every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME selfjoinfree conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking functions from graphs to hypergraphs.
We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

The complexity of resilience and responsibility for selfjoinfree conjunctive queries
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou
PVLDB 9(3):180191, 2015.
selection
[VLDB],
[full arXiv version],
[bib]
Project page: Causality
Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as "the resilience of a query" and show its connections to the wellstudied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for selfjoinfree conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source sideeffects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of selfjoinfree conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source sideeffects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.

Linearized and singlepass belief propagation
Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos
PVLDB 8(5):581592, 2015.
selection
[VLDB],
[Full arXiv version],
[slides (2MB)],
[narrated slides (32MB)],
[video (21min)],
[Python code],
[SQL code],
[bib]
Project page: SSLH
How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. A wellknown problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops.
This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closedform solution via intuitive matrix equations and, thus, comes with exact convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multiclass settings. Plus, it allows a compact implementation in SQL. The paper also introduces Singlepass Belief Propagation (SBP), a localized (or "myopic") version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard BP, while leading to almost identical node labels.

Approximate lifted inference with probabilistic databases
Wolfgang Gatterbauer, Dan Suciu
PVLDB 8(5):629640, 2015.
selection
[VLDB],
[arXiv version],
[slides (4MB)],
[bib]
Project page: Propagation
Invited to the Special Issue of VLDB Journal on VLDB 2015
This paper proposes a new approach for approximate evaluation
of #Phard queries with probabilistic databases. In our approach,
every query is evaluated entirely in the database engine by evaluating
a fixed number of query plans, each providing an upper bound
on the true probability, then taking their minimum. We provide
an algorithm that takes into account important schema information
to enumerate only the minimal necessary plans among all possible
plans. Importantly, this algorithm is a strict generalization of
all known results of PTIME selfjoinfree conjunctive queries: A
query is safe if and only if our algorithm returns one single plan.
We also apply three relational query optimization techniques to
evaluate all minimal safe plans very fast. We give a detailed experimental
evaluation of our approach and, in the process, provide
a new way of thinking about the value of probabilistic methods over
nonprobabilistic methods for ranking query answers.

FaultTolerant Entity Resolution with the Crowd
Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann
arXiv:1512.00537.
[arXiv]