- Provenance, Datalog, and The One Semiring to Rule Them All

Provenance, Datalog, and The One Semiring to Rule Them All

April 29, 2021

  • Why is this tuple in my query result?
  • Why is this tuple not in my query result?
  • Which datasets were used to create this value?
  • How does this input affect my query output?

Provenance

How does the input data relate to a query output.

Types of Provenance

Why Provenance (Lineage)
What's the smallest fragment of my input needed to produce some row
Why-Not Provenance
What's the least I can add to my input to get a desired row
How Provenance
An execution trace of the result; How were the tuples combined?
Where Provenance
Which cell(s) was a given output value taken from
Taint
Was the output affected by any "tainted" input cell/row
RAB
112
213
323
424
SBC
125
226
336

Why is $\left<1\right>$ in $\pi_A (R \bowtie S)$?

RAB
112
SBC
125

... but that's not the only reason

RAB
112
213
323
424
SBC
125
226
336

Why is $\left<1\right>$ in $\pi_A (R \bowtie S)$?

$\left\{ R_1, S_1 \right\}$, $\left\{ R_1, S_2 \right\}$, $\left\{ R_2, S_3 \right\}$

Witness: Any subset of the original database that still produces the same result.
(Generally we only want 'minimal' witnesses)

RAB
112
213
323
424
SBC
125
226
336

How is $\left<1\right>$ derived in $\pi_A (R \bowtie S)$?

$(R_1 \bowtie S_1) \oplus (R_1 \bowtie S_2) \oplus (R_2 \bowtie S_3)$

Outline

  • Datalog
  • Reasoning about provenance
  • Implementing provenance queries
  • Cool things to do with provenance

Datalog

[head] :- [body]

$$Q(A) :-~~ R(A, B), S(B, C)$$

like SELECT A FROM R NATURAL JOIN S

A
Head Variable (appears in the head and body)
B, C
Existential Variables (appear only in the body)

Stop thinking about relations as collections of records, and instead think of them as collections of facts

RAB
112
213
323
424

The fact $R(1, 2)$ is true.

The fact $R(2, 1)$ is false (or unknown).

A table contains all facts that are provably true.

$$Q(A) :-~~ R(A, B), S(B, C)$$

For any $A$, the fact $Q(A)$ is true if...
  • there is some $B$ and $C$ for which...
  • the fact $R(A, B)$ is true, and...
  • the fact $S(B, C)$ is true.

$\forall A : \big( \exists B, C : R(A, B) \wedge S(B, C) \big) \rightarrow Q(A)$

$$Q(A) :-~~ R(A, B), S(B, C)$$ $$Q(A) :-~~ R(A, B), R(B, C)$$

Treat multiple rules as a disjunction.
($Q(A)$ is true if any rule is satisfied)

As powerful as Set-RA

Projection
$Q := \pi_A(R)$
$Q(A) :-~~ R(A, \ldots)$
Union
$Q := R \cup S$
$Q(\ldots) :-~~ R(\ldots)$
$Q(\ldots) :-~~ S(\ldots)$
Join
$Q := R \bowtie S$
$Q(\ldots) :-~~ R(\ldots), S(\ldots)$
Selection (Equality)
$Q := \sigma_{R.A = R.B}(R)$
$Q(A) :-~~ R(A, A)$
Selection (Equality')
$Q := \sigma_{R.A = 1}(R)$
$Q(B) :-~~ R(1, B)$
Selection (Other)
$Q := \sigma_{A > B}(R)$
$Q(A,B) :-~~ R(A, B), [[ A > B ]]$
$[[ A > B ]]$AB
10
20
30
...
21
...

Relations are Sets of Facts. We can have a relation consisting of all pairs $A, B$ where $A$ is bigger.

A Finite Relation
... declares a finite number of true facts
An Infinite Relation
... declares an infinite number of true facts

Safety Property: Every variable must appear in at least one finite relation in a rule body.

Recursion

Recursive datalog: The body can reference the head atom

$$Q(A, B) :-~~ R(A, B)$$ $$Q(A, C) :-~~ Q(A, B), R(B, C)$$

(~Dijkstra's algorithm)

Datalog, Top-Down

$$Q(A, C) :-~~ R(A, B), S(B, C)$$

... is like a very large number of queries with no head variables

$$Q_{1, 1}() :-~~ R(1, B), S(B, 1)$$ $$Q_{1, 2}() :-~~ R(1, B), S(B, 2)$$ $$Q_{1, 3}() :-~~ R(1, B), S(B, 3)$$

...

The fact $Q(1, 1)$ is true if $\exists B : R(1, B) \wedge S(B, 1)$

Think of the relation as a function from potential facts to their truthiness.

RAB
112→ T
213→ T
323→ T
424→ T
511→ F
6...→ F

Every row not explicitly listed is mapped to False

$Q(A) :-~~ R(A, B), S(B, C)$

$Q(1) :-~~ R(1, B), S(B, C)$

$Q(1) \equiv R(1, 1) \wedge S(1, 1)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 1)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 1)$

...

$~~~~~~~~ \vee ~~ R(1, 1) \wedge S(1, 2)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 2)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 2)$

...

RAB
112
213
323
424
SBC
121
222
333

$Q(1) \equiv R(1, 1) \wedge S(1, 1)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 1)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 1)$

...

$~~~~~~~~ \vee ~~ R(1, 1) \wedge S(1, 2)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 2)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 2)$

...

RAB
112
213
323
424
SBC
121
222
333

$Q(1) \equiv R(1, 1) \wedge S(1, 1)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(2, 1)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(3, 1)$

...

$~~~~~~~~ \vee ~~ R(1, 1) \wedge S(1, 2)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(2, 2)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(3, 2)$

...

RAB
112
213
323
424
SBC
121
222
333

$Q(1) \equiv ~~~~~F~~~~ \wedge S(1, 1)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(2, 1)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(3, 1)$

...

$~~~~~~~~ \vee ~~ ~~~~~F~~~~ \wedge S(1, 2)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(2, 2)$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge S(3, 2)$

...

RAB
112
213
323
424
SBC
121
222
333

$Q(1) \equiv ~~~~~F~~~~ \wedge ~~~~F~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~T~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~F~~~~~$

...

$~~~~~~~~ \vee ~~ ~~~~~F~~~~ \wedge ~~~~F~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~T~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~F~~~~~$

...

RAB
112
213
323
424
SBC
121
222
333

$Q(1) \equiv$$~~~~~~F~~~~ \wedge ~~~~F~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~T~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~F~~~~~$

...

$~~~~~~~~ \vee ~~ ~~~~~F~~~~ \wedge ~~~~F~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~T~~~~~$

$~~~~~~~~ \vee ~~ ~~~~~T~~~~ \wedge ~~~~F~~~~~$

...

RAB
112
213
323
424
SBC
121
222
333

$Q(1) \equiv$$~R(1, 1) \wedge S(1, 1)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 1)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 1)$

...

$~~~~~~~~ \vee ~~ R(1, 1) \wedge S(1, 2)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 2)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 2)$

...

RAB
112
212
312
413
523
623
724
RAB
112→ 3
213→ 1
323→ 2
424→ 1
RAB
112→ 3
213→ 1
323→ 2
424→ 1
SBC
121→ 1
222→ 2
333→ 3

$Q(1) =~?$

RAB
112→ 3
213→ 1
323→ 2
424→ 1
SBC
121→ 1
222→ 2
333→ 3

$Q(1) = 3\times 1 + 3 \times 2 + 1 \times 3 = 12$

$Q(1) \equiv R(1, 1) \wedge S(1, 1)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 1)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 1)$

...

$~~~~~~~~ \vee ~~ R(1, 1) \wedge S(1, 2)$

$~~~~~~~~ \vee ~~ R(1, 2) \wedge S(2, 2)$

$~~~~~~~~ \vee ~~ R(1, 3) \wedge S(3, 2)$

...

RAB
112→ 3
213→ 1
323→ 2
424→ 1
SBC
121→ 1
222→ 2
333→ 3

$Q(1) \equiv R(1, 1) \times S(1, 1)$

$~~~~~~~~ + ~~ R(1, 2) \times S(2, 1)$

$~~~~~~~~ + ~~ R(1, 3) \times S(3, 1)$

...

$~~~~~~~~ + ~~ R(1, 1) \times S(1, 2)$

$~~~~~~~~ + ~~ R(1, 2) \times S(2, 2)$

$~~~~~~~~ + ~~ R(1, 3) \times S(3, 2)$

...

RAB
112→ 3
213→ 1
323→ 2
424→ 1
SBC
121→ 1
222→ 2
333→ 3

$Q(1) \equiv 0 \times 0$

$~~~~~~~~ + ~~ 3 \times 1$

$~~~~~~~~ + ~~ 1 \times 0$

...

$~~~~~~~~ + ~~ 0 \times 0$

$~~~~~~~~ + ~~ 3 \times 2$

$~~~~~~~~ + ~~ 1 \times 0$

...

RAB
112→ 3
213→ 1
323→ 2
424→ 1
SBC
121→ 1
222→ 2
333→ 3

$Q(1) \equiv $$~0 \times 0$

$~~~~~~~~ + ~~ 3 \times 1$

$~~~~~~~~ + ~~ 1 \times 0$

...

$~~~~~~~~ + ~~ 3 \times 0$

$~~~~~~~~ + ~~ 3 \times 2$

$~~~~~~~~ + ~~ 1 \times 0$

...

RAB
112→ a
213→ b
323→ c
424→ d
SBC
121→ e
222→ f
333→ g
RAB
112→ a
213→ b
323→ c
424→ d
5...→ $\mathbf{0}$
SBC
121→ e
222→ f
333→ g
4...→ $\mathbf{0}$
RAB
112→ a
213→ b
323→ c
424→ d
5...→ $\mathbf{0}$
SBC
121→ e
222→ f
333→ g
4...→ $\mathbf{0}$


$Q(1) \equiv R(1, 1) \otimes S(1, 1)$

$~~~~~~~~ \oplus ~~ R(1, 2) \otimes S(2, 1)$

$~~~~~~~~ \oplus ~~ R(1, 3) \otimes S(3, 1)$

...

$~~~~~~~~ \oplus ~~ R(1, 1) \otimes S(1, 2)$

$~~~~~~~~ \oplus ~~ R(1, 2) \otimes S(2, 2)$

$~~~~~~~~ \oplus ~~ R(1, 3) \otimes S(3, 2)$

...

RAB
112→ a
213→ b
323→ c
424→ d
5...→ $\mathbf{0}$
SBC
121→ e
222→ f
333→ g
4...→ $\mathbf{0}$


$Q(1) \equiv \mathbf{0} \otimes \mathbf{0}$

$~~~~~~~~ \oplus ~~ a \otimes e$

$~~~~~~~~ \oplus ~~ b \otimes \mathbf{0}$

...

$~~~~~~~~ \oplus ~~ \mathbf{0} \otimes \mathbf{0}$

$~~~~~~~~ \oplus ~~ a \otimes f$

$~~~~~~~~ \oplus ~~ b \otimes \mathbf{0}$

...

$(a\otimes e) \oplus (a \otimes f) \oplus (b \otimes g) \oplus \mathbf{0} \oplus \ldots$

$(T\wedge T) \vee (T \wedge T) \vee (T \wedge T) \vee F \vee \ldots$

$(3\times 1) + (3 \times 1) + (1 \times 3) + 0 + \ldots$

... and more

Ground Rules for $\oplus$, $\otimes$

  • Commutative, Associative
  • Must be some $\mathbf{0}$ s.t. $a \oplus \mathbf{0} = a$
  • Must be some $\mathbf{1}$ s.t. $a \otimes \mathbf{1} = a$
  • $a \otimes \mathbf{0} = \mathbf{0}$
  • $a \otimes (b \oplus c) = (a \otimes b) \oplus (a \otimes c)$

Any pair of operators (along with their domain) that follows these rules is called a commutative semiring

Commutative Semirings

$$\left< \mathbb S, \oplus, \otimes, \mathbf{0}, \mathbf{1} \right>$$
  • $\left< \mathbb N^0, +, \times, 0, 1 \right>$ (Natural Arithmetic)
  • $\left< \mathbb B, \vee, \wedge, F, T \right>$ (Boolean Algebra)
  • $\left< \text{Set}, \cup, \cap, \emptyset, \infty \right>$ (Set Algebra)
  • $\left< \text{Bag}, \uplus, \cap, \emptyset, \infty \right>$ (Bag Algebra)
  • $\left< \mathbb N^{0,\infty}, \max, \min, 0, \infty \right>$ (Access Control)
  • $\left< \mathbb R^{-\infty}, \min, +, -\infty, 0 \right>$ (Tropical)

Bringing it Back to RA

If a Table is a function, so is a query result!

$$[[\pi_A R(A, B)]](x)$$
$$[[\sigma_\phi R(A, B)]](x, y)$$
$$[[R(A, B) \cup S(A, B)]](x, y)$$
$$[[R(A, B) \times S(B, C)]](x, y, z)$$
$$[[\pi_A R(A, B)]](x) = \sum_{B} R(x, B)$$

Sum over all projected-away variables

$$[[\sigma_\phi R(A, B)]](x, y) =~~~~~\\~~~~~ \begin{cases} R(x, y) & \textbf{if } \phi(x, y) \\ \mathbf{0} & \textbf{otherwise}\end{cases}$$

Truncate filtered rows to 0.

$$[[R(A, B) \cup S(A, B)]](x, y) =~~~~~\\~~~~~ R(x, y) \oplus S(x, y)$$

Sum annotations through union.

$$[[R(A, B) \times S(B, C)]](x, y, z) =~~~~~\\~~~~~ R(x, y) \otimes S(y, z)$$

Multiply annotations through cross product.

Mimicking RA

Set-Relational Algebra
Plug in the Boolean Algebra Semiring
Bag-Relational Algebra
Plug in the Natural Arithmetic Semiring

Domain of Tuple IDs: $\mathbb T$

A Set of Tuple IDs: $2^{\mathbb T}$

A Set of Sets of Tuple IDs: $2^{2^{\mathbb T}}$

e.g., $\{ \{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\} \}$

Adding sets of sets

$$\{ \{t_1, t_5\}, \{t_1, t_6\} \} \cup \{ \{t_2, t_7\} \}$$
$$ = \{ \{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\} \}$$
$$A \cup \{ \} = A$$

Multiplying sets of sets

$$\{ \{t_1\}, \{t_2\} \} \times \{ \{t_3\}, \{t_4\} \}$$
$$ = \{ \{t_1, t_3\}, \{t_2, t_3\}, \{t_1, t_4\}, \{t_2, t_4\} \}$$
$$A \times \{ \{\} \} = A$$
$$A \times \{ \} = \{ \}$$
$$\left< 2^{2^{\mathbb T}}, \cup, \times, \{ \}, \{ \{\} \} \right>$$
RAB
112→ $\{\{t_1\}\}$
213→ $\{\{t_2\}\}$
323→ $\{\{t_3\}\}$
424→ $\{\{t_4\}\}$
SBC
121→ $\{\{t_5\}\}$
222→ $\{\{t_6\}\}$
333→ $\{\{t_7\}\}$
$$Q(1) = \{\{t_1\}\}\times\{\{t_5\}\} \cup \{\{t_1\}\}\times\{\{t_6\}\} \cup \{\{t_2\}\}\times\{\{t_7\}\}$$
$$Q(1) = \{\{t_1, t_5\}\} \cup \{\{t_1, t_6\}\} \cup \{\{t_2, t_7\}\}$$
$$Q(1) = \{\{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\}\}$$

Polynomials are also a semiring

RAB
112→ a
213→ b
323→ c
424→ d
SBC
121→ e
222→ f
333→ g

$ae + af + bg$

Plug in boolean annotations: $T$

Plug in multiplicities: $12$

Plug in tuple IDs:
$\{\{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\}\}$

$ae + af + bg$

$a: T \rightarrow F$ (Booleans)
$Q(1) = bg = T$
$a: 3 \rightarrow 4$ (Naturals)
$Q(1) \texttt{ += } e + f$
$\frac{d}{da}Q(1) = e+f$