Differential Dataflow for Mere Mortals

rstuart4133 · on June 17, 2021

I love clear, simple language like this. Kudo's to the author:

> The jargon-free version: You write code that manipulates collections using familiar operations like map, join and loop. You run the code on some input and get some output. Then when the input changes, you get changes to the output, much faster than recomputing the whole thing from scratch. (And the outputs will be correct!)

A complete newbie can read those two sentences and get the complete picture.

zozbot234 · on June 17, 2021

That description is clear and simple indeed, but these are generally called "dynamic" https://en.wikipedia.org/wiki/Dynamic_algorithm in the literature. "Differential" refers to something else altogether, namely being able to automatically compute the gradient of some general calculation on continuous-valued inputs! (The intersection of differential and dynamic problems, where the point is to recompute answers quickly when some continuous-valued input changes, is called "kinetic".)

maweki · on June 17, 2021

Differential Dataflow is (and has been for ages) the name of calculating the result of a Datalog query and yielding delta-sets of the result given delta-sets of the database.

The name "differential" is not exclusive to gradients of continuous functions.

zozbot234 · on June 17, 2021

Could you provide a source that shows "differential" being used in such a context by the Datalog community? Regardless, the proper term for this in the general CS/algorithms literature is "dynamic" as I've shown above. "Differential" is used for something else entirely, and "kinetic" covers the notable shared aspects of the two problem domains.

maweki · on June 17, 2021

https://raw.githubusercontent.com/TimelyDataflow/differentia...

zozbot234 · on June 17, 2021

So, the earliest source you can find for this term is a paper from 2012, whereas the wiki article on dynamic algorithms was started in 2007(!), and the sources are of course much older - https://en.wikipedia.org/wiki/Dynamization cites a source from the mid-1980s! It doesn't look like there's any reason so far to assume "differential" should be given priority - even more so when it's already in use for something else!

maweki · on June 17, 2021

1. This here project is based on the paper I mentioned. Therefore it is of utmost relevance.

2. Differential Relational Calculus is an even older theory https://doi.org/10.1109/69.683760 and is is a kind of derivative on Relations using delta-sets. No need for continuous functions. Using this name is fine and describes the problem really well.

3. Dynamic Programming on the other hand does not describe the problem well. It is about re-use of computation but usually not on the side of a changing input.

Edit: Boolean Differential Calculus is from 1959. I don't remember the booleans being continuous.

zozbot234 · on June 17, 2021

> ... Differential Relational Calculus is an even older theory https://doi.org/10.1109/69.683760 and is is a kind of derivative on Relations using delta-sets. ...

That paper while prima facie relevant (so I appreciate that you mentioned it!), is about solving the problem of integrity maintenance, that is the only sense in which "differential" or "differentially" is used there, and then only as a matter of informal convenience and not as actively proposing any new terminology.

It does not define any true "differential" of a general computation as Differentiable Programming does in its own context, but only shows how to dynamize the checking of integrity constraints. So, I still maintain that uncritical adoption of such informal terminology, when clearly the proper alternative already exists in the literature, can only bring confusion and should be avoided.

maweki · on June 17, 2021

I'm sorry, but differential describes pretty good what's happening here. Given a Query Q and a database instance I, we calculate a Result R

Q(I) = R

And indeed we derive some function D for the differential and are given a set of changes C where

D_Q(I)(C) = R' with Q(I+C)=R+R'

This D is clearly a differential function.

And if you look at the wikipedia entry for dynamic programming https://en.wikipedia.org/wiki/Dynamic_programming and indeed the important sentence "there is a relation between the value of the larger problem and the values of the sub-problems."

In this case here there is no splitting into sub-problems going on. We only have a function (the Query) and we calculate how the result changes when the input changes. If that's not a differential, I don't know what is.

missingstep3 · on June 17, 2021

With the jargon-free version of this concept I can think of an example. To recompute the sum of the elements of a vector when one of the elements change you just add the new - old value to the sum. I'm I right?

Garlef · on June 17, 2021

Yes; Or: If you have a very long list of integers, for example [1, 2, ..., 12351234] and your output would be the same list but all numbers increased by one (keeping the index/order): [2, 3, ..., 1235123]. If you now add a new element to the end of the initial list, you'd only have to add one element to your output instead of iterating over the whole list.

user3939382 · on June 17, 2021

I started studying SQL when I was a freshman in high school circa 2000, the book I had was called SQL Queries for Mere Mortals. Is this a play on that, or was that title an homage to something even earlier?

z3t4 · on June 17, 2021

It's a saying that you do not have to be a genius to understand it. For example the em edtior (editor for mere mortals) where you could see what you wrote on the screen instead of having to keep it all in your head... https://en.wikipedia.org/wiki/Vi

outside1234 · on June 17, 2021

I think it is a slight swipe at how complicated the Materialize OSS projects make everything in this space sound

thelastbender12 · on June 17, 2021

That is because doing incremental computation efficiently is complicated!

I'd think the project idea is that differential dataflow is too useful a technology to only be understood by a few people. By writing it a second time around, it might be feasible to trade off some of its power for a more lean and accessible implementation.

FridgeSeal · on June 17, 2021

To be fair, whilst the underlying theory is pretty conceptually straightforward (as illustrated here), fully-featured/non-toy implementations, especially when it intersects with a database are somewhat more complex.

modeless · on June 17, 2021

I wonder if you could implement a compiler with this.

FridgeSeal · on June 17, 2021

Someone who knows more can probably correct/refine me, but I believe Rusts compiler actually has a little “differential datalog” engine it, which uses differential data flow under the hood.

akiselev · on June 17, 2021

They used to but Frank McSherry (author of differential dataflow) wrote them a specialized version without all the dataflow infrastructure [1]. It's part of the rust-lang nursery [2] now but hasn't been updated in a while, so I'm not sure what happened to it.

Edit: Looks like the rustc_mir crate that implements the non-lexical borrow checker uses `polonius_engine` [3] which uses `datafrog` so it's still used by the compiler!

[1] https://github.com/frankmcsherry/blog/blob/master/posts/2018...

[2] https://github.com/rust-lang/datafrog

[3] https://crates.io/crates/polonius-engine

sriram_malhar · on June 17, 2021

Most language structures are acyclic, so a simple dataflow pipeline is sufficient.

Mutual references (class A referring to class B and vice versa) are resolved by separating the signature from method implementation (for example) to break cycles.