Any-k: Optimal ranked enumeration for Dynamic Programs
Dynamic programming is a very general algorithmic technique for solving an optimization problem by recursively breaking it down into simpler subproblems. But what if you want to enumerate all solutions in decreasing order of optimality in of just getting retrieving the optimal solution? We study the algebraic limits of distributing the ORDER BY operator over the JOIN operator and thereby generalize dynamic programming over certain algebraic structures called "selective dioids" (semirings with an ordering property) to optimal ranked enumeration of answers. The resulting framework is as general as dynamic programming itself and generalizes seemingly different problems that had been studied in isolation in the past. We call it "any-k" to emphasize the marriage of "anytime algorithms" with "top-k." As direct application, we apply our theory to the special case of ranked enumeration over conjunctive queries with equi-joins and theta-joins and show that our proof-of-concept implementations beat current database systems by orders of magnitude. |
Algebraic Amplification
How can we perform semi-supervised node classification super fast even we have only very few labeled data points? To answer this question, we have been investigating algebraic properties that allow algorithms that can compress the data before handing it over to the learning module. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph summaries. We refer to "algebraic amplification" as the underlying idea of leveraging algebraic properties of an algorithm's update equations to amplify sparse signals in data. We show that our estimator is by orders of magnitude faster than alternative approaches and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap pre-processing step for any existing label propagation method and removes the current dependence on heuristics. |
Factorized Knowledge Representation
Given propositional knowledge in explicit form. What computational resources do we need to find a representation that is logically equivalent but minimal in size? As concrete problem instance we are studying is finding the minimal-size factorization of the provenance of queries. Our problem and approach is a complete solution that naturally recovers exact solutions for all known PTIME cases, as well as identifying additional queries for which the problem can be solved in PTIME. We makes first steps towards a dichotomy of the data complexity of the problem. |
QueryVis: helping users understand and re-use existing SQL queries
Re-use of existing SQL queries is difficult since queries are complex structures and not easily understood. QueryViz is a method for translating SQL into a visual formalism that helps users intuitively understand their meaning. It is inspired by the First-Order logic representation of a query and combines succinctness features of both tuple and domain relational calculus, thereby providing a minimal yet expressive visual vocabulary. The project page has a link to the yet incomplete online demo. |
Dissociation and Propagation for Approximate Lifted Inference
Evaluating queries over probabilistic databases needs to calculate the reliability of a query and is #P-hard in general. We propose an alternative efficient approach for approximate lifted inference which (i) is always in PTIME for every query and data instance (and even expressible in relational algebra), (ii) always results in a unique and well-defined score in [0,1], (iii) is an upper bound to query reliability and both are identical for safe queries, and (iv) is inspired by widely deployed propagation approaches for ranking in networks, and generalizes those from graphs to hypergraphs. |
Bootstrapping Virtuous Active Learning cYcles (BOVALY)
We are developing a novel Quiz Item Management System (QIMS) that engages students throughout the semester in a sequence of learning activities. Each of these learning activities fosters critical thinking, and together they lead to knowledge artifacts in the form of quiz items. These artifacts go through cycles of creation, improvements, selections and deduplications and result in re-usable learning artifacts with known provenance and item response characteristics. Combined, the approach thus addresses 5 important challenges for technology-enhanced learning (e.g., MOOCs): 1. Continuous practice and assessment 2. Practice of "generative" skills, 3. "Auto-calibrated" peer evaluation, 4. Re-usable learning artifacts, and 5. Optimal use of instructors' time. |
Table-As-Query as Semi-Supervised Table Alignment
Given a table provided by a user and a large repository of tables of unknown accuracy and unknown relevance. How can we best complete the seed table with information from the repository? Our focus is both table discovery (finding a set of alignable tables), as well as the best way to integrate (or align) the new data with the query table. In this new paradigm called "table-as-query", the user does not need to know a priori on which attributes various tables in a repository are best aligned. |
Semi-supervised Learning with Heterophily (SSLH)
SSLH is a family of linear inference algorithms that generalize existing graph-based label propagation algorithms by allowing them to propagate generalized assumptions about ``attraction'' or ``compatibility'' between classes of neighboring nodes (in particular those that involve heterophily between nodes where ``opposites attract''). Importantly, this framework allows us to reduce the problem of estimating the relative compatibility between nodes from partially labeled graph to a simple optimization problem. The result is a very fast algorithm that -- despite its simplicity -- surprisingly effective algorithm. |
Causality in databases: helping users understand unexpected query results
When queries return unexpected results, users are interested in explanations for what they see. Recently, the database community has proposed various notions of lineage of query results, such as why or where provenance, and very recently, explanations for non-answers. We propose to unify and extend existing approaches for provenance and missing query result explanations ("positive and negative provenance") into one single framework, namely that of understanding causal relationships in databases. |
BeliefDB: managing conflicting beliefs inside a standard DBMS
Data management is becoming increasingly social. Questions arise about how to best model inconsistent and changing opinions in a community of users inside a DBMS. First, we propose an annotation semantics based on modal logic, which allows users to engage in a structured discussion. Second, we propose a principled solution to the automated conflict resolution problem. While based on the certain tuples of all stable models of a logic program, our algorithm is still in PTIME. |
VENTex (Visualized Element Nodes Table Extraction)
Web tables contain a vast amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, extracting and interpreting web tables is difficult, because of the variety in which tables can be encoded in HTML and style sheets. Our approach VENTex can extract arbitrary web tables by focusing on table representations in the "Visual Web" instead of the "Syntactic Web" as used by previous approaches. |
Unique Recall: modeling the limits of knowledge acquisition from redundant data
Often in life, 20% of effort can achieve roughly 80% of the desired effects. Interestingly, this does not hold in the context of web information extraction. We develop an analytic model for information acquisition from redundant data (in the limit of infinitely large data), and use it to derive a new 40-20 rule (crawling 20% of the Web will help us learn less then 40% of its content). We further describe a new family of power law functions that remains invariant under sampling, and use its properties to give a second rule of thumb. |
CQMS (Collaborative Query Management System)
While modern database management systems (DBMSs) provide sophisticated tools for managing data, tools for managing queries over these data have remained relatively primitive. As analysts today share and explore increasingly large volumes of data, they need assistance for repeatedly issuing their queries. This project develops the essential techniques for a Collaborative Query Management System (CQMS) which provides new capabilities ranging from query browsing to automatic query recommendations. |
Economics of deregulated electricity markets
We revisit commonly accepted assumptions about the economics of deregulated electricity markets. First, we disprove that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities trade and the same optimal social welfare. Second, we show that a generator owner's optimum bid sequence for an auction market is generally above marginal cost, even where absolutely no abuse of market power is involved. |
Graz Cycle: a zero emission power cycle of highest efficiency
The Graz Cycle is a thermodynamical combustion cycle that allows to retain and capture CO2 emissions stemming from combustion processes. It burns fossil fuels with pure oxygen which enables the cost-effective separation of the combustion CO2 by condensation. The efforts for the oxygen supply in an air separation plant are partly compensated by cycle efficiencies far higher than 65%. The combined efficiency is equal in thermodynamic performance to any other proposal in the field of Carbon Capture and Storage (CCS). |