Publications (Hide summaries) (Show abstracts)

2020

Ranked Enumeration
Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries
Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, Xiaofeng Yang
Generalizes dynamic programming over certain algebraic structures called "selective dioids" to optimal ranked enumeration of answers. Applies the theory to the special case of ranked enumeration over conjunctive queries. We call it "any-k" to emphasize the marriage of "anytime algorithms" with top-k.
We study ranked enumeration of the results to a join query in order of decreasing importance, as imposed by an appropriate ranking function. Our main contribution is a framework for ranked enumeration over a class of Dynamic Programming problems that generalizes seemingly different problems that to date had been studied in isolation. To this end, we study and extend classic algorithms that find the k-shortest paths in a weighted graph. For full conjunctive queries, including cyclic ones, our approach is asymptotically optimal in terms of (1) time to return the top result, (2) delay between results, and (3) time until all results are returned in order. These optimality properties were derived for a complexity measure that has been widely used in the context of worst-case optimal join algorithms, but which abstracts away factors that only depend on query size and a polylogarithmic cost factor in input size. By performing a more detailed cost analysis, we are able to uncover a previously unknown tradeoff between two incomparable enumeration approaches: one has lower complexity when the number of returned results is small, the other when it is large. We demonstrate theoretically and empirically the superiority of our techniques to batch algorithms that produce the full result and then sort it. Interestingly, our technique is not only faster for returning the first few results, but even when all results are produced.
Ranked Enumeration
Optimal Join Algorithms meet Top-k
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
Presents a unified point of view for studying optimal join algorithms and top-k query evaluation.
Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.
Algebraic Amplification
Factorized Graph Representations for Semi-Supervised Learning from Sparse Data
Krishna Kumar P., Paul Langton, Wolfgang Gatterbauer
Introduces the concept of "algebraic amplification" with a method for parameter learning that can amplify weak signals in extremely sparsely labeled data. The resulting end-to-end accuracy is quasi identical to using gold standard parameters, while time for learning is negligible compared to inference time.
Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that iteratively pass messages along edges, starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix, which must thus be provided by either domain experts or heuristics.
We instead suggest a principled and scalable method for directly estimating the compatibilities from a sparsely labeled graph. This method, which we call distant compatibility estimation, works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We refer to algebraic amplification as the underlying idea of leveraging algebraic properties of an algorithm’s update equations to amplify sparse signals in data. We show that our estimator is by orders of magnitude faster than alternative approaches and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap pre-processing step for any existing label propagation method and removes the current dependence on heuristics.
QueryVis project page
QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster
Aristotelis Leventidis, Jiahui Zhang, Cody Dunne, Wolfgang Gatterbauer, HV Jagadish, Mirek Riedewald
Shows that logical diagrams automatically created from SQL queries help users understand the queries faster and with fewer errors than SQL itself. Our ultimate goal is to allow users of SQL to reason about queries in terms of "diagrammatic SQL patterns" based on first-order logic.
Understanding the meaning of existing SQL queries is critical for maintaining and reusing them. Yet SQL can be hard to read, even for expert users, and even for the original creator of a query. We conjecture that it is possible to capture the logical intent of queries in automatically-generated visual diagrams that can help users understand the meaning of queries faster and more accurately than reading SQL alone.
We present initial steps in that direction with visual diagrams that are based on the first-order logic foundation of SQL and can capture a substantial fragment of SQL (in particular, deeply nested SQL queries). Our diagrams build upon a rich history of diagrammatic reasoning systems in logics and were designed using a large body of human-computer interaction best practices: they are minimal in that no visual element is superfluous; they are unambiguous in that no two queries with different semantics map to the same visualization; and they extend visual representations of relational schemata and conjunctive queries in a natural way. An experimental evaluation of our visual diagrams on Amazon Turk (n = 42) shows that with only a 2–3 minute static tutorial, participants could interpret queries meaningfully faster with our automatically-generated diagrams than when reading SQL alone. Moreover, we have some evidence that our visual diagrams result in participants making fewer errors than with SQL.
We believe that more regular exposure to diagrammatic representations of SQL can give rise to a pattern-based and thus more intuitive use and re-use of SQL.
Optimal bandjoins
Near-Optimal Distributed Band-Joins through Recursive Partitioning
Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald
Proposes a method for performing band-joins that leads to near-optimal data partitions in a distributed environment. Extensive experiments show the approach to be within 10% of the theoretic lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.
We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance).
Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.
Causality
New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou
Introduces "independent join paths" as a unifying hardness criterion for hardness of resilience for conjunctive queries with or without self-joins.
The resilience of a Boolean query is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects.
In this paper, we give several novel results on the hardness of the resilience problem for binary conjunctive queries with self-joins (i.e. conjunctive queries with relations of maximal arity 2) with one repeated relation. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various NP-hardness results. We also give novel involved reductions to network flow to show certain cases are in P. Overall, we give a dichotomy result for the restricted setting when one relation is repeated at most 2 times, and we cover many of the cases for 3.

2019

Anytime Approximation
Anytime Approximation in Probabilistic Databases via Scaled Dissociations
Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gatterbauer, Floris Geerts, Martin Theobald
Develops a general anytime approximation schema for approximate probabilistic inference. Experiments show speed up over prior model-based approximations by up to factor 1000. Uses our TODS 2015 dissociation-based bounds that are provably tighter than model-based bounds.
Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMC style sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.
Algorithms for Automatic Ranking of Participants and Tasks in an Anonymized Contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
Invited and extended version of WALCOM 2017
We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of participants and a set of tasks that every participant attempts. For each participant-task pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in their difficulty to be solved. In an ideal world, stronger participants should succeed at a superset of tasks that weaker participants succeed at. Similarly, easier tasks should be completed successfully by a superset of participants who succeed at harder tasks. In reality, it can happen that a stronger participant fails at a task that a weaker participants succeeds at. Our goal is to find a perfect nesting of the participant-task relations by flipping a minimum number of participant-task relations, implying such a "nearest perfect ordering" to be the one that is closest to the truth of participant strengths and task difficulties. Many variants of the problem are known to be NP-hard.
Algebraic Approximations of the Probability of Boolean Functions
Wolfgang Gatterbauer
SUM (Invited Keynote), pp. 449-450, 2019

2018

Optimal Oblvious bounds
Dissociation-based Oblivious Bounds for Weighted Model Counting
Li Chou, Wolfgang Gatterbauer, Vibhav Gogate
Generalizes the oblivious bounds from TODS 2014 from monotone to non-monotone Boolean functions
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.
Ranked Enumeration
Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs
Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K. Nicholson, Mirek Riedewald, Alessandra Sala
Gives a ranked enumeration algorithm for tree patterns (or acyclic graph queries). Introduces the idea of any-k: a marriage between anytime algorithms and top-k evaluation
Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to find the top-k matches according to a ranking function over edge and node weights. For users, it is difficult to select value k. We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, return as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.
Ranked Enumeration
Any-k Algorithms for Exploratory Analysis with Conjunctive Queries
Xiaofeng Yang, Mirek Riedewald, Rundong Li, Wolfgang Gatterbauer
Introduces the idea of ranked enumeration of answers to cyclic CQs based on a union over multiple tree decompositions
Learning From Query-Answers: A Scalable Approach to Belief Updating and Parameter Learning
Niccolò Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
ACM TODS, vol. 43, no. 4, 17:1--17:41, 2018. (Special Issue on SIGMOD 2017).
Invited and extended version of SIGMOD 2017
A General Framework for Anytime Approximation in Probabilistic Databases
Maarten Van den Heuvel, Floris Geerts, Wolfgang Gatterbauer, Martin Theobald
Anytime approximation algorithms that compute the probabilities of queries over probabilistic databases can be of great use to statistical learning tasks. Those approaches have been based so far on either (i) sampling or (ii) branch-and-bound with model-based bounds. We present here a more general branch-and-bound framework that extends the possible bounds by using "dissociation," which yields tighter bounds.
PISTIS: A Conflict of Interest Declaration and Detection System for Peer Review Management
Siyuan Wu, Leong Hou U, Sourav S. Bhowmick, Wolfgang Gatterbauer
Detecting conflicts of interest (COIs) is key for guaranteeing the fairness of a peer-review process. In many conference management systems, the COIs of authors and reviewers are self-declared, and the declaration process is time consuming and potentially incomplete. To address this problem, we demonstrate a novel interactive system called PISTIS that assists the declaration process in a semi-automatic manner. Apart from keyword search and simple filtering, our system provides an interactive graphical interface that helps users explore potential COIs based on the heterogenous data sources. To simply the process of declaration, we also recommend latent COIs using a supervised ranking model that can be iteratively refined from the data collected from past declarations. We believe that PISTIS can be useful as an assistant tool in many real world conference management systems
Web Data Extraction System
Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob

2017

Approximate lifted inference
Dissociation and propagation for approximate lifted inference with standard relational database management systems
Wolfgang Gatterbauer, Dan Suciu
Extends VLDB 2015 with all proofs and shows how the approach generalizes the idea of "graph propagation" to propagation on hypergraphs
This paper proposes an approach to uncertain query evaluation by which every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking functions from graphs to hypergraphs. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.
Beta probabilistic databases: A scalable approach to belief updating and parameter learning
Niccolo Meneghetti, Oliver Kennedy, Wolfgang Gatterbauer
Invited to the Special Issue of TODS on "best of SIGMOD 2017"
Tuple-independent probabilistic databases (TI-PDBs) handle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database derives the marginal probabilities of each output-tuple, assuming input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.
Algebraic Amplification
The linearization of belief propagation on pairwise Markov Random Fields
Wolfgang Gatterbauer
Generalizes the linearization from VLDB 2015 to arbitrary pairwise MRFs
Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the update equations around default values, which was shown to work well for the problem of node classification. The present paper generalizes all prior work and derives an approach that approximates loopy BP on any pairwise MRF with the problem of solving a linear equation system. This approach combines exact convergence guarantees and a fast matrix implementation with the ability to model heterogenous networks. Experiments on synthetic graphs with planted edge potentials show that the linearization has comparable labeling accuracy as BP for graphs with weak potentials, while speeding-up inference by orders of magnitude.
Conflict of interest declaration and detection system in heterogeneous networks
Siyuan Wu, Leong Hou U, Sourav S Bhowmick, Wolfgang Gatterbauer
Peer review is the most critical process in evaluating an article to be accepted for publication in an academic venue. When assigning a reviewer to evaluate an article, the assignment should be aware of conflicts of interest (COIs) such that the reviews are fair to everyone. However, existing conference management systems simply ask reviewers and authors to declare their explicit COIs through a plain search user interface guided by some simple conflict rules. We argue that such declaration system is not enough to discover all latent COI cases. In this work, we study a graphical declaration system that visualizes the relationships of authors and reviewers based on a heterogeneous co-authorship network. With the help of the declarations, we attempt to detect the latent COIs automatically based on the meta-paths of a heterogeneous network.
Algorithms for automatic ranking of participants and tasks in an anonymized contest
Yang Jiao, R. Ravi, Wolfgang Gatterbauer
Invited to the Special Issue of Elsevier TCS on WALCOM 2017
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.

2015

Causality
The complexity of resilience and responsibility for self-join-free conjunctive queries
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou
Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as "the resilience of a query" and show its connections to the well-studied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for self-join-free conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source side-effects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of self-join-free conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source side-effects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.
Linearized Belief Propagation
Linearized and single-pass belief propagation
Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos
How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily ("birds of a feather flock together") or heterophily ("opposites attract"). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. A well-known problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closed-form solution via intuitive matrix equations and, thus, comes with exact convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multi-class settings. Plus, it allows a compact implementation in SQL. The paper also introduces Single-pass Belief Propagation (SBP), a localized (or "myopic") version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard BP, while leading to almost identical node labels.
Approximate lifted inference
Approximate lifted inference with probabilistic databases
Wolfgang Gatterbauer, Dan Suciu
Invited to the Special Issue of VLDB Journal on VLDB 2015
This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers.
Fault-Tolerant Entity Resolution with the Crowd
Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann

2014

Optimal Oblvious bounds
Oblivious bounds on the probability of Boolean functions
Wolfgang Gatterbauer, Dan Suciu
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Algebraic Amplification
Semi-supervised learning with heterophily
Wolfgang Gatterbauer
We propose a novel linear semi-supervised learning formulation that is derived from a solid probabilistic framework: belief propagation. We show that our formulation generalizes a number of label propagation algorithms described in the literature by allowing them to propagate generalized assumptions about influences between classes of neighboring nodes. We call this formulation Semi-Supervised Learning with Heterophily (SSL-H). We also show how the modularization matrix can be learned from observed data with a simple convex optimization framework that is inspired by locally linear embedding. We call this approach Linear Heterophily Estimation (LHE). Experiments on synthetic data show that both approaches combined can learn heterophily of agraph with 1M nodes and 10M edges in under 1min.

2013

Deregulated Electricity Markets
Counterexamples to commonly held assumptions on unit commitment and market power assessment
Wolfgang Gatterbauer, Marija Ilic
Chapter 10 of: Engineering IT-Enabled Sustainable Electricity Services, Marija Ilic, Le Xie, Qizing Liu (Eds.), Springer, 2013.
Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues. In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.

2012

Querying provenance for ranking and recommending
Zachary Ives, Andreas Haeberlen, Tao Feng, Wolfgang Gatterbauer

2011

Default-all is dangerous!
Wolfgang Gatterbauer, Alexandra Meliou, Dan Suciu
Knowledge Aquisition
Rules of thumb for information acquisition from large and redundant data
Wolfgang Gatterbauer
This paper shows that the "default-all propagation" scheme for database annotations can have some problematic semantic consequences. We propose an alternative "minimum-propagation" provenance that fixes the issue and that comes with several desirable properties.
QueryVis project page
QueryViz: Helping users understand SQL queries and their patterns
Jonathan Danaparamita, Wolfgang Gatterbauer
Tracing data errors with view-conditioned causality
Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu
This paper shows how causal reasoning can be used for post-factum data cleaning, where errors are detected after data has been transformed (e.g. by a query) and which need to be corrected in the original input data. We achieve this with a novel way for translating the problem into a SAT and a weighted MAX-SAT problem, and then using very effective existing tools.
QueryVis project page
Databases will visualize queries too
Wolfgang Gatterbauer
This paper describes a human-query interaction pattern in which users re-use existing queries as templates to compose their own queries. This interaction mode is only made possible with new visualization tools which help users understand SQL patterns and the intent of existing SQL queries quickly. QueryViz (now QueryVis) is our new visualization approach.
Managing structured collections of community data
Wolfgang Gatterbauer, Dan Suciu
Causality
Reverse data management
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu
Causality
Bringing provenance to its full potential using causal reasoning
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu
Session-based browsing for better query reuse
Nodira Khoussainova, Yongchul Kwon, Wei-Ting Liao, Magdalena Balazinska, Wolfgang Gatterbauer, Dan Suciu

2010

Conflict resolution
Data conflict resolution using trust mappings
Wolfgang Gatterbauer, Dan Suciu
Causality
The complexity of causality and responsibility for query answers and non-answers
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu
Dissociation and propagation for efficient query evaluation over probabilistic databases
Wolfgang Gatterbauer, Abhay K. Jha, Dan Suciu
MUD 2010, pp. 83-97
Causality
Causality in databases
Alexandra Meliou, Wolfgang Gatterbauer, Joseph Halpern, Christoph Koch, Katherine F. Moore, Dan Suciu
Causality
Why so? or Why no? Functional causality for explaining query answers
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu
MUD 2010, pp. 3-17
Visual structure-based web page clustering and retrieval
Paul Bohunsky, Wolfgang Gatterbauer

2009

Conflict resolution
Believe it or not: Adding belief annotations to databases
Wolfgang Gatterbauer, Magda Balazinska, Nodira Khoussainova, Dan Suciu
Integrating and ranking uncertain scientific data
Landon Detwiler, Wolfgang Gatterbauer, Brent Louie, Dan Suciu, Peter Tarczy-Hornoch
Collaborative Query Management
A Case for a collaborative query management system
Nodira Khoussainova, Magdalena Balazinska, Wolfgang Gatterbauer, YongChul Kwon, Dan Suciu
Web Data Extraction System
Robert Baumgartner, Wolfgang Gatterbauer, Georg Gottlob

2007

Web tables
Towards Domain-Independent Information Extraction from Web Tables
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, Bernhard Pollak
Creating Permanent Test Collections of Web Pages for Information Extraction Research
Bernhard Pollak, Wolfgang Gatterbauer
SOFSEM 2007, Volume II, pp. 103-115
WebPageDump is a Firefox extension which allows you to save local copies of pages from the Web. It sounds simple, but it's not. The standard "Save page as" function of web browsers fails with most dynamic web pages and this shortcoming was a serious problem for our research. Comes the birth of WebPageDump. We hope you find it useful too.

2006

Web tables
Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model
Wolfgang Gatterbauer, Paul Bohunsky
Knowledge Aquisition
Estimating Required Recall for Successful Knowledge Acquisition from the Web
Wolfgang Gatterbauer

2005

Using visual cues for extraction of tabular data from arbitrary HTML documents
Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer
Web information extraction using eupeptic data in Web tables
Wolfgang Gatterbauer, Bernhard Krüpl, Wolfgang Holzinger, Marcus Herzog

2002 (Electrical Engineering)

Deregulated Electricity Markets
Counterexamples to commonly held assumptions on unit commitment and market power assessment
Wolfgang Gatterbauer, Marija Ilic
Deregulated Electricity Markets
Interdependencies of electricity market characteristics and bidding strategies of power producers
Wolfgang Gatterbauer
Master's thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2002
Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues.
In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.
Finally, by taking statistic uncertainties and inter-temporal interdependencies into account, we show in the third part that a generator owner's optimum bid sequence for a centralized auction market is generally above marginal cost. In contrast to current literature, this is true even where absolutely no abuse of market power is involved. We conclude that marginal production costs cannot be used by themselves as baseline for the assessment of market power in electricity markets.

2000 (Mechanical Engineering)

Der Graz Cycle für Industriekraftwerke gefeuert mit Brenngasen aus Kohle- und Schwerölvergasung
(Combining the Graz Cycle with coal and heavy oil gasification for industrial power stations)
Herbert Jericha, Armin Lukasser, Wolfgang Gatterbauer
VDI-Berichte Nr. 1566, Gas Turbines for Combined Cycles, pp. 177-185, Essen, Germany, September 2000

Other links