Probabilistic programming does in 50 lines of code what used to take thousands

jb55 · on April 13, 2015

If anyone wants to jump into this, Josh Tenenbaum and Noah Goodman put together this amazing interactive book for learning probabilistic programming with Church: https://probmods.org/

rrtwo · on April 14, 2015

Great project but very weird choice of language. How probable is it (pun intended) that the person coming to learn about probabilistic programming would already know functional programming?

Retra · on April 14, 2015

Isn't functional programming a standard part of any computer science curriculum? Why would you expect programmers not to know it?

vidarh · on April 14, 2015

Even the places where it is a part of the curriculum, it is often such a small part that unless people specifically take courses related to functional programming you can't expect them to be able to actually use it. Or remember much of it for that matter.

Heck, I spent months on a binge reading functional programming research papers, and it still doesn't mean I know any functional languages other than very superficially.

Retra · on April 14, 2015

That's true of any language though. If you know Java, you don't automatically know C#, but the core concepts will still mostly transfer. You'll still know Object-Oriented Programming, even if you've never used C# and can't write HelloWorld without looking something up.

saryant · on April 14, 2015

I was very surprised to learn from two Stanford CS alums that functional programming was not a requirement in that program.

Widdershin · on April 14, 2015

Not everyone does a computer science degree.

g0wda · on April 14, 2015

No it's not, in most of the world.

wodenokoto · on April 14, 2015

I actually considered that book as good as any ressource for functional programming introduction.

stevenspasbo · on April 13, 2015

Is that 50 lines of code, or 50 lines of using a library that's thousands of lines of code?

eob · on April 13, 2015

It's a fair question, but from my heydays doing ML research even good libraries can be pretty revolutionary. The engineering-side of ML is still in relative infancy.

For a great example look at all the papers Andrew McCallum's group has been able to publish by building on top of FACTORIE: they get to focus their time on the problem at hand rather than all the math required to solve it. Basically they can write code that generates a model dynamically but the framework handles all the inference. Compare that to how these things are built without such a framework: you spend most of your time painstakingly hand-deriving update rules and then implementing them as code.

IMHO, the exciting thing is that ML is getting closer and closer to being an everyday tool for engineers rather than something that requires you to be a full-time math person to use effectively.

dxbydt · on April 13, 2015

> ML is getting closer and closer to being an everyday tool for engineers rather than something that requires you to be a full-time math person

I so wish! As a Sr Data Scientist, I interview potential candidates quite often, many of these are 10x engineers.

    Me: (2,3,4) is a vector.  
    Eng: Ok.
    Me: Gimme a unit vector in the same direction.
    Eng 1: ???
    Eng 2: "It can be done. I don't know how, but with Spark it can be done".
    Eng 3: I will need R. ( given R, he fiddles with it for 5 minutes getting nowhere fast )

There are actual humans out there with self-professed ML expertise who cannot compute the eigens of a tiny 2 by 2 diagonal matrix. I kid you not. These people make 150k salaries, have "heard of an eigen vector", but cannot find one to save their lives.

xanderjanz · on April 14, 2015

"to save their lives" is a little dramatic, considering they were in a closed tech interview. I'm sure if you gave them 20 seconds on the internet (you know, the resource they would have in a real life problem solving situation) they could find one.

gh02t · on April 14, 2015

My graduate adviser use to say, if you ask "how do you add 1+1?" a student with a bachelor's degree answers "2". A person with a master's degree will write a 10 line C program to calculate it, then tell you "2". A person with a PhD just says "I don't know, but I can tell you where to find the answer."

eli_gottlieb · on April 14, 2015

Hmmm... I don't think I remember how to compute the eigenvectors of a 2x2 matrix, other than constructing a system of linear equations representing the outcome of the matrix multiplication and solving it.

But come on, everyone knows how to use the definition of the standard (l2) vector norm to normalize a vector! Don't joke!

CamperBob2 · on April 14, 2015

Wouldn't you just take the length of the vector with sqrt(4+9+16) and divide each component by it? Or is this a different kind of 'vector' than graphics people are familiar with?

dxbydt · on April 14, 2015

But Obviously. The same chap can invoke the SparseVector and DenseVector Java APIs from Spark, but doesn't know what a simple unit vector is or how to compute one. I think we are entering into a bleak era where there are umpteen built-in ML APIs for everything under the sun but the engineers who use them have no idea what these things really are - they just invoke the constructor and write some code and think they are now doing ML.

pedrosorio · on April 14, 2015

Seems hard to believe that an "engineer" with a degree in engineering or CS wouldn't know what is a unit vector.

vidarh · on April 14, 2015

Why? There are plenty of directions you can take through a CS degree that will never have you dealing with vectors at all. And subsequently there are plenty of CS heavy careers where you will never deal with it.

What little I remember about vectors is from my high school maths classes, and in my 20 years of doing software engineering it's come up exactly once (for a GIS related project).

pedrosorio · on April 17, 2015

You are probably right about CS. I only have experience with engineering and wouldn't expect any engineer (i.e. a major in engineering) to not know that.

I wouldn't expect anyone working with machine learning to not know these concepts either.

pmalynin · on April 14, 2015

Well, um ,a diagonal matrix is already diagonalizaed (lol) so its P marix is {{1,0},{0,1}} and hence the you can use those two as your diagonal entries are the eigenvalues.

baldfat · on April 13, 2015

What do we say about languages built on C? Is it 100 lines of code but there are hundreds of thousands of lines of code for that higher level language you just coded?

I don't think libraries count in terms of code. We all use code to program. Standing on the shoulder that preceded us. Using a library and a function should just count for the most part.

jarrettc · on April 13, 2015

True, but the parent commenter is getting at something important. The article suggests that researchers have found a new, much more concise way to express the solutions to difficult problems. That's different from a library, which merely packages pre-built solutions to a finite set of problems.

It's like the difference between a complete kitchen that fits in your pocket and an iPhone app that lets you order a burrito. The article suggests something like the former. A library which encapsulates 1000 lines of code into a single function call is like the latter.

derefr · on April 13, 2015

Even a "kitchen that fits in your pocket" isn't general enough. A library that parses a DSL, such that you can then code in that DSL, is still a lame duck if that library, plus the encodings of all the useful solutions in the domain, add up to more code than just the solutions would be when expressed in a general language. The ROI of (learning the DSL + maintaining the library) would be negative.

On the other hand, there are things like Prolog. You can think of Prolog as a backtracking constraint-solving library, and then another library that parses a DSL for expressing facts and procedural constraints and feeds it to the first library. But Prolog's language isn't really a DSL, because it isn't particular to any domain: there's no closed solution-space where Prolog applies. The efficiency gains you get from Prolog's elision of proceduralized contraint-solution code can apply to any program you write. And so its value is unbounded; its ROI is certainly positive, whatever the cost was to implement it.

That's the comparison that's useful here, I think. Is this something that only solves problems in one domain? Or is this something that could be applied to (at least some little bits of) any problem you encounter?

baldfat · on April 13, 2015

Some languages are almost UNUSABLE without libraries. Looking at R with the dplyr, ggplot2, reshape 2 packages being mandatory for my work flow.

Python also has some mandatory libraries if you want to do any specific. Numpy and pandas for statistical analysis make them required.

My code is concise and clean but it is because of these libraries.

I am also certain that there are exceptions.

x0054 · on April 14, 2015

I think stevenspasbo did not mean to criticize the progress in this area, but rather just point out the misleading title of the article. Unless you program everything in pure ASM, Bios and up, you always stand on the shoulders of others, that's how since works. But the title of the article, "Probabilistic programming does in 50 lines of code what used to take thousands," makes it sound as if there was a solution to these complex problems that could be expressed in 50 lines of code all along, and only now did we figure it out. The truth is, people have written thousands of lines of code, and put them into libraries, and now those tools can be used and leveraged to express complex solutions more easily.

stevenspasbo · on April 15, 2015

Thank you, that's exactly what I meant.

Liru · on April 13, 2015

How about C itself? Hello World is about 850 lines of code after you run the preprocessor on it.

duaneb · on April 13, 2015

You do not need 850 lines of code to output "hello world" unless you actually include a header. You could just stub printf instead of including and get it done with like 10 lines of code.

baldfat · on April 14, 2015

Assembly: Microsoft Macro Assembly "Hello World"

        .MODEL  TINY
        .CODE

CODE SEGMENT BYTE PUBLIC 'CODE' ASSUME CS:CODE,DS:CODE ORG 0100H DB 'HELLO WORLD$', 0 INC DH MOV AH,9 INT 21H RET

FreeFull · on April 14, 2015

And then the interrupt calls into DOS code, which then calls into BIOS code, which then has the hardware do the work. The hardware itself will do stuff like translating instructions into microcode..

spb · on April 13, 2015

If dependencies don't count as code, then I can implement the entire Linux kernel in one line.

aet · on April 13, 2015

I agree -- I think lines of code is the wrong measure. A better measure would be how large is the set of problems that occur in practice that normally would take thousands of lines which can now solve be solved with 50 lines of code using a probabilistic programming language.

nanofortnight · on April 13, 2015

Generally it's anything that you can express using a probabilistic graphical model. It's mainly useful in engineering (real engineering), control systems, robotics, inference, machine learning and artifical intelligence circles.

colordrops · on April 13, 2015

Once a library becomes first class syntactical features of a language runtime then it shouldn't count towards line count anymore.

noedig · on April 14, 2015

I was thinking the same thing. When is something just a library, and when is it something more? Think about concurrency in Go. Goroutines are cool, but what makes them more than just a cooperative threading library? Channels? Those are just concurrent data structures. The special sauce in Go is the select structure. It allows you to easily express control flow that cannot be expressed without select (or at least not easily). What is the special sauce in these probabilistic programming languages? What could not be implemented using a library?

Retr0spectrum · on April 13, 2015

You could make a similar comparison, for example, between 50 lines of C and 1000 bytes of microcode. (I just made those numbers up, but you get the point)

mafuyu · on April 13, 2015

It makes sense to talk about lines of code when thinking about engineering hour efficiency. It costs the same amount for a programmer to write one line of C as one line of Assembly, and the rate of bugs tracks lines of code as well. The underlying libraries and abstractions should be considered if you're thinking about computing efficiency.

choward · on April 13, 2015

Using lines of code as a metric makes no sense at all. Bad programmers produce more lines of code. Does that make them more efficient?

mafuyu · on April 13, 2015

A good programmer that writes less lines of code is more efficient than a bad programmer that writes more to do the same work, because you pay each per line of code. A library that allows both to write less lines of code will increase efficiency for everyone.

Retra · on April 14, 2015

The time it takes to learn and master the language aren't irrelevant either.

collyw · on April 14, 2015

That's what I was thinking. A good framework can do something comparable if your application fits its standard model.

joelthelion · on April 13, 2015

The point is that the library is generic - it can be used for a wide variety of difficult problems.

cheatsheet · on April 13, 2015

> “It goes beyond image classification — the most popular task in computer vision — and tries to answer one of the most fundamental questions in computer vision: What is the right representation of visual scenes?

Can someone knowledgeable in graphics research explain the context that this question comes from?

If I am reading the question correctly, I infer that the question suggests that there exists a right way to reproduce the visual experience of reality. To me, this sounds like a question that is equally valid to have no answer (or many answers) in aesthetics, art, and philosophy, etc.

rasz_pl · on April 13, 2015

Think about Dreaming. "seeing" during a dream state works by experiencing pure data representation of the real world. People fluent in lucid dreaming can tell you something funny happens when you try to thorough examine objects while sleeping. Constructed worlds tend to be skin deep, and fall apart when poked. Everything is build with ideas drawn from your experience.

Its Plato's Allegory of the Cave all the way down.

Imagine "watching" a movie compressed using your very own prior knowledge. Every scene could be described in couple of hundred lines of plaintext. Today we do this by reading a book :) What if we could build an algorithm able to render movies from books?

sedachv · on April 13, 2015

> What if we could build an algorithm able to render movies from books?

Bob Coyne has been working on a system for generating images of still scenes from text descriptions for about 15 years now:

https://www.wordseye.com/ http://www.cs.columbia.edu/~coyne/papers/wordseye_siggraph.p...

TheGrassyKnoll · on April 14, 2015

Fascinating Captain, machine intelligence meets philosophy.

"The world is such and such or so and so, only because we talk to ourselves about its being such and such and so and so..." Carlos Castaneda

digi_owl · on April 13, 2015

Makes me think of Vocaloid.

tel · on April 13, 2015

The wiggle word here is "right", I suppose. It's easy to ascribe meanings to that word which are very difficult to use---my limited understanding of Philosophy makes me think that this is the realm of ideas like "qualia" and the like.

For a long time statisticians wrangled over this word in a reduced context. The "art" of statistics is to build a model of the world which is sufficiently detailed to capture interesting data but not so detailed to make it difficult to interpret as a human decision-maker. Statisticians usually solve this problem by building a lot of models, getting lucky, presenting things to people and seeing what sticks.

For a long time this lack of a notion of "rightness" was so powerful that it precluded advancement of the field in certain ways.

With the advent of computers we discovered a new, even more precise form of "right" however and this formed the bedrock of Machine Learning. The "right" ML is concerned with is predictive power. A model is "right" when it leads to a training and prediction algorithm which is "probably, approximately correct", e.g. you can feed real data in and end up with something useful (with a high degree of probability).

So with respect to computer vision we know that it is very difficult to build "efficient" algorithms, ones which work well while using a reasonable amount of training data. CV moved forward when it realized that there were representations of the visual field which led to better predictive power---these were originally generated by studying the visual center of human and animal brains, but more recently have been generated "naively" by computers.

So, there's a reasonably well-defined way that we can find the "right" representation of visual scenes: if we find one which ultimately is best-in-class of all representations for any choice of ML task then it's "right".

darkmighty · on April 13, 2015

I like this definition, it's almost equivalent to the one given below by me: if you have a good predictor you can compress the information well, but not optimally. But to compress optimally, you need more than an optimal (single outcome) predictor, you need a predictor that will output probabilities of various events close to the true probability.

So in some sense optimal compression gives the best you could hope, up to limitations of the probabilistic models, which is why I like this explanation.

darkmighty · on April 13, 2015

The question is fundamental to all kinds of recognition: recognizing the invariants of the scene, the data that distinguishes it from other scenes, which is very close to the definition of Shannon information.

For example, if you can extract a 'Mesh' from a 2D picture, you can generate many other view points, and that mesh can be considered a good representation. If you are more sophisticated however (and perhaps have a larger "dictionary"), you can instead extract 'There are two wooden chairs 1m from each other, ...'.

That's the sense in which the representation is fundamental to computer vision -- it distills what the system knows (or what it wants to know) about scenes. The more concise the representation without loss of information the smarter your system is (and past a point becomes a general AI problem).

nabla9 · on April 13, 2015

You can figure out statistical structure of natural images (or just faces) and derive efficient representations with similar properties as to those observed in the visual system of the brain.

See for example:

Natural Image Statistics — A probabilistic approach to early computational vision https://www.cs.helsinki.fi/u/ahyvarin/natimgsx/

murbard2 · on April 13, 2015

Yes, you can specify extremely powerful statistical models in only a few lines of code using probabilistic programming.

However, at this point, unless you design your program in a very specific way and use a lot of tricks, your sampler is very unlikely to converge, and you won't get any meaningful result without a gargantuan amount of computing power.

dang · on April 13, 2015

Also https://news.ycombinator.com/item?id=9363496 from yesterday.

ub · on April 13, 2015

No experience but looks like they are organizing a summer school on probabilistic programming languages. http://ppaml.galois.com/wiki/wiki/SummerSchools/2015/Announc...

pliny · on April 13, 2015

relevant paper: http://mrkulk.github.io/www_cvpr15/1999.pdf

e12e · on April 14, 2015

Thanks for that. As far as I can tell the compiler for the Picture language hasn't been published?

newpattern · on April 13, 2015

Anybody here on HN have experience with probabilistic-programming ? This looks quite disruptive if it works.

BenoitEssiambre · on April 13, 2015

I've played with it a bit and, in my opinion, the principles behind it at least, the streamlined and optimized simulation of bayesian generative models is the best chance we have to solve artificial general intelligence.

Reading probmods.org and dippl.org made me go from being very pessimistic I would see it in my lifetime to a solid maybe.

murbard2 · on April 13, 2015

It's a very powerful statistical technique, yes, but I doubt it will be enough for AI. The problem is that you need to sample efficiently from the posterior distribution, and for anything AI related, MCMC isn't going to cut it.

Let's take a step back and consider a logical problem. Can you put N socks in N-1 boxes such that no box contains more than one socks? Obviously not, it's the pigeonhole principle.

Convert that question into a Boolean circuit, and throw a SAT solver at it. It will die. In fact, using only first order logic, a proof of the pigeonhole principle requires an exponential number of terms. Looking at logical propositions alone is too myopic to solve the problem, you have to formulate higher level theories about the structure of the problem to solve it (in this case, natural numbers).

The same goes for probabilistic programming. As long as the paradigm is to treat the problem as a black box energy function to sample from, it is doomed to be inefficient. Try writing a simple HMM and the system will choke, even though there are efficient algorithms to sample from such a model.

If you look at deep learning techniques, they take an interesting approach which is to learn to approximate the generative distribution and the inference distribution at the same time. This is the basis of the work around autoencoders, deep belief networks, and it guarantees that you can tractably sample from your latent representation.

BenoitEssiambre · on April 14, 2015

I have been out of the machine learning field for years and I haven't look into deep learning methods even though they are intriguing (just not sufficiently bayesian to pique my curiosity enough to spend my limited free time on it). I do believe that automatically building a hierarchical structure (as I assume happens in deep learning) is the way to abstract away complexity but I think this is achievable within a generative bayesian monte carlo approach.

For example, I have been toying with using MCMH similarly to how it is used in dippl.org to write a kind of probabilistic program that generates other programs by sampling from a probabilistic programming language grammar. A bit like with the arithmetic generator here: https://probmods.org/learning-as-conditional-inference.html#... but for whole programs.

After the MCMH has converged to a program that generates approximately correct output, you can tune grammar hyperparameters on the higher likelihood output program samples so that next time it will converge faster.

I don't know if this counts as approximating "the generative distribution and the inference distribution at the same time" under your definition but my hope is that the learned grammar rules are good abstractions for hierarchical generators of learned concepts.

Of course the worry is that my approach will not converge very fast but there are reasons to think that having a suitably arranged hyperparametrized probabilistic grammar might use the Occams' razor inherent in bayesian methods to produce exponentially fewer, simpler grammar rules that generate approximations when it doesn't have enough data to converge to more complex and precise programs and that these simple rules which rely on fewer parameters might provide the necessary intermediate steps to then pivot to more complex rules. These smaller steps help MCMH to find a convergence path. Not sure how well it will work for complex problems however. My idea is still half baked. I have ton's of loose ends and details I have not figured out, some of which I might not even have a solution, as well as little time to work on this (this is just a hobby for me).

Anyways, all that to say that probabilistic programming can go beyond just hardcoding a generative model and running it.

murbard2 · on April 14, 2015

Using a probabilistic program to generate and interpret a program is an interesting idea, but I think you're missing the main problem.

How do you sample from the set of programs that produce the correct (or approximately correct) output?

You could use rejection sampling, but that would take very long as only a tiny fraction of all possible programs produce the desired output.

You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.

You mention hyperparameters, but what would those be? If you're building a probabilistic grammar, they might be things like emissions probabilities... but tweaking the proportion of "if" statements you produce, as opposed to, say, function calls, is hardly going to make you better at sampling from the set of correct programs.

In general, the idea of meta-learning is a powerful one, but without a way to guide the search, I don't see it as feasible.

BenoitEssiambre · on April 14, 2015

" There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space."

You correctly identify the main hurdle. I admit I am not sure it will work.

However, I think it might be possible to design the programming language such that changing a few bits doesn't usually radically alter the output.

For example, let's say you are trying to learn a grammar to draw a blue irregular polygon. If there are primitives with fewer parameters that can get you an approximate output, say a circle, this makes possible a learning path that looks like "circle"->"blue circle"->"simple blue polygon"->"complex blue polygon". In addition to that, if the grammar rules that generate similar output can be clustered in the space of grammar rules, small jumps in this space may give you small output changes. Using bayesian priors will naturally use the simpler, fewer parameter shapes first and pivot to more complex ones as enough information is learned while, I think, creating these more complex rules close to the simple ones in the space of grammar rules. That is my hope anyways. I got it working as expected-ish with a simple vision example like I just described.

eli_gottlieb · on April 14, 2015

>You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.

Ahaha, "this sounds like a job for topology!"

eli_gottlieb · on April 14, 2015

For the record, proving the pigeonhole principle in Coq is sufficiently doable to be a mere four-star textbook exercise (http://www.cis.upenn.edu/~bcpierce/sf/current/MoreLogic.html). No exponential explosion of terms required!

>The problem is that you need to sample efficiently from the posterior distribution, and for anything AI related, MCMC isn't going to cut it.

Which is precisely why so much of the actual research work in probabilistic programming is to add more and faster inference strategies. One paper already got a 600x speedup just by using some techniques from the compilers/VM world related to caching whole and partial computation traces.

nl · on April 13, 2015

me go from being very pessimistic I would see it in my lifetime to a solid maybe

Very probabilistic..

nanofortnight · on April 13, 2015

If you want to know about the history of similar language paradigms, look at logic programming. Similar to probabilistic programming, it is declarative, utilises an engine at run time, the user builds a model of the word then queries it, and it uses fewer lines of code to express for problems within a limited domain.

Hardly "distruptive". You should really think of them as DSLs that people actually use.

If you're intrested in such things, perhaps you should take a look at PRISM, which embeds a probabilistic framework within B-Prolog. http://rjida.meijo-u.ac.jp/prism/

nabla9 · on April 13, 2015

Probabilistic programming is way to increase flexibility of statisticall modelling. You only write the code that generates data and give statistical model and possibly some parameters to fix approximations used in estimation and get out model from that.

For example, Stan-language supports MCMC modelling using Hamiltonian dynamics.

http://mc-stan.org/

king_magic · on April 13, 2015

Just like UML and Prolog.

eximius · on April 13, 2015

Really? "Disruptive"?

vezzy-fnord · on April 13, 2015

Constraint solvers and similar tools can look pretty magical to people who use them for the first time, so I don't blame the OP.

What's more interesting is that we're seeing heightened interest in these techniques again after they were ostensibly sidetracked in favor of statistical methods.

sidrajaram · on April 13, 2015

I'm no expert, but afaik probabilistic programming isn't a new method or technique. It is just wrappers around existing statistical techniques, as an attempt to divorce the details of inference algorithms with model specifications.

I'm not buying in just yet, because although it's nice to talk about model specification as completely independent processes, the availability of fast inference algorithms sometimes dictates what models you should choose. Sometimes less exact models with a larger parameter space that allows you to crunch orders of magnitude larger datasets (with approximate inference algorithms) yield more useful results than better specified models...and sometimes not. The thing if one still needs to know the whens and whys of picking certain models over others, and can't just gloss over the inference details.

troels · on April 13, 2015

Maybe I just don't get the terminology right, but isn't this exactly a statistical method?

sgt101 · on April 13, 2015

The questions are "when does it fall off a cliff" and "does it work for more than one thing". Normally the answers are "as soon as you stop doing toy problems" or "no".

But, this (and Church) look very interesting.

oh_sigh · on April 14, 2015

Sure. If someone here has nothing more than high school math under their belt, they may not understand that this isn't a huge deal.

For example, I refer you to the paper wherein some biologists "discovered" integrals in 1993. http://care.diabetesjournals.org/content/17/2/152.abstract

eli_gottlieb · on April 14, 2015

Ah, so they wrote a DSL just for doing Bayesian inverse-vision models, and apparently they've now got it to accuracy rates competitive with most other major vision methods?

Good job!

contingencies · on April 13, 2015

Probabilistic prediction: this is primarily going to be used for robots that monitor, kill or assist with killing people.

Retra · on April 14, 2015

Alternatively, one could monitor, save, or assist with saving people.