Anybody here on HN have experience with probabilistic-programming ? This looks q...

BenoitEssiambre · on April 13, 2015

I've played with it a bit and, in my opinion, the principles behind it at least, the streamlined and optimized simulation of bayesian generative models is the best chance we have to solve artificial general intelligence.

Reading probmods.org and dippl.org made me go from being very pessimistic I would see it in my lifetime to a solid maybe.

murbard2 · on April 13, 2015

It's a very powerful statistical technique, yes, but I doubt it will be enough for AI. The problem is that you need to sample efficiently from the posterior distribution, and for anything AI related, MCMC isn't going to cut it.

Let's take a step back and consider a logical problem. Can you put N socks in N-1 boxes such that no box contains more than one socks? Obviously not, it's the pigeonhole principle.

Convert that question into a Boolean circuit, and throw a SAT solver at it. It will die. In fact, using only first order logic, a proof of the pigeonhole principle requires an exponential number of terms. Looking at logical propositions alone is too myopic to solve the problem, you have to formulate higher level theories about the structure of the problem to solve it (in this case, natural numbers).

The same goes for probabilistic programming. As long as the paradigm is to treat the problem as a black box energy function to sample from, it is doomed to be inefficient. Try writing a simple HMM and the system will choke, even though there are efficient algorithms to sample from such a model.

If you look at deep learning techniques, they take an interesting approach which is to learn to approximate the generative distribution and the inference distribution at the same time. This is the basis of the work around autoencoders, deep belief networks, and it guarantees that you can tractably sample from your latent representation.

BenoitEssiambre · on April 14, 2015

I have been out of the machine learning field for years and I haven't look into deep learning methods even though they are intriguing (just not sufficiently bayesian to pique my curiosity enough to spend my limited free time on it). I do believe that automatically building a hierarchical structure (as I assume happens in deep learning) is the way to abstract away complexity but I think this is achievable within a generative bayesian monte carlo approach.

For example, I have been toying with using MCMH similarly to how it is used in dippl.org to write a kind of probabilistic program that generates other programs by sampling from a probabilistic programming language grammar. A bit like with the arithmetic generator here: https://probmods.org/learning-as-conditional-inference.html#... but for whole programs.

After the MCMH has converged to a program that generates approximately correct output, you can tune grammar hyperparameters on the higher likelihood output program samples so that next time it will converge faster.

I don't know if this counts as approximating "the generative distribution and the inference distribution at the same time" under your definition but my hope is that the learned grammar rules are good abstractions for hierarchical generators of learned concepts.

Of course the worry is that my approach will not converge very fast but there are reasons to think that having a suitably arranged hyperparametrized probabilistic grammar might use the Occams' razor inherent in bayesian methods to produce exponentially fewer, simpler grammar rules that generate approximations when it doesn't have enough data to converge to more complex and precise programs and that these simple rules which rely on fewer parameters might provide the necessary intermediate steps to then pivot to more complex rules. These smaller steps help MCMH to find a convergence path. Not sure how well it will work for complex problems however. My idea is still half baked. I have ton's of loose ends and details I have not figured out, some of which I might not even have a solution, as well as little time to work on this (this is just a hobby for me).

Anyways, all that to say that probabilistic programming can go beyond just hardcoding a generative model and running it.

murbard2 · on April 14, 2015

Using a probabilistic program to generate and interpret a program is an interesting idea, but I think you're missing the main problem.

How do you sample from the set of programs that produce the correct (or approximately correct) output?

You could use rejection sampling, but that would take very long as only a tiny fraction of all possible programs produce the desired output.

You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.

You mention hyperparameters, but what would those be? If you're building a probabilistic grammar, they might be things like emissions probabilities... but tweaking the proportion of "if" statements you produce, as opposed to, say, function calls, is hardly going to make you better at sampling from the set of correct programs.

In general, the idea of meta-learning is a powerful one, but without a way to guide the search, I don't see it as feasible.

BenoitEssiambre · on April 14, 2015

" There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space."

You correctly identify the main hurdle. I admit I am not sure it will work.

However, I think it might be possible to design the programming language such that changing a few bits doesn't usually radically alter the output.

For example, let's say you are trying to learn a grammar to draw a blue irregular polygon. If there are primitives with fewer parameters that can get you an approximate output, say a circle, this makes possible a learning path that looks like "circle"->"blue circle"->"simple blue polygon"->"complex blue polygon". In addition to that, if the grammar rules that generate similar output can be clustered in the space of grammar rules, small jumps in this space may give you small output changes. Using bayesian priors will naturally use the simpler, fewer parameter shapes first and pivot to more complex ones as enough information is learned while, I think, creating these more complex rules close to the simple ones in the space of grammar rules. That is my hope anyways. I got it working as expected-ish with a simple vision example like I just described.

eli_gottlieb · on April 14, 2015

>You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.

Ahaha, "this sounds like a job for topology!"

eli_gottlieb · on April 14, 2015

For the record, proving the pigeonhole principle in Coq is sufficiently doable to be a mere four-star textbook exercise (http://www.cis.upenn.edu/~bcpierce/sf/current/MoreLogic.html). No exponential explosion of terms required!

>The problem is that you need to sample efficiently from the posterior distribution, and for anything AI related, MCMC isn't going to cut it.

Which is precisely why so much of the actual research work in probabilistic programming is to add more and faster inference strategies. One paper already got a 600x speedup just by using some techniques from the compilers/VM world related to caching whole and partial computation traces.

nl · on April 13, 2015

me go from being very pessimistic I would see it in my lifetime to a solid maybe

Very probabilistic..

nanofortnight · on April 13, 2015

If you want to know about the history of similar language paradigms, look at logic programming. Similar to probabilistic programming, it is declarative, utilises an engine at run time, the user builds a model of the word then queries it, and it uses fewer lines of code to express for problems within a limited domain.

Hardly "distruptive". You should really think of them as DSLs that people actually use.

If you're intrested in such things, perhaps you should take a look at PRISM, which embeds a probabilistic framework within B-Prolog. http://rjida.meijo-u.ac.jp/prism/

nabla9 · on April 13, 2015

Probabilistic programming is way to increase flexibility of statisticall modelling. You only write the code that generates data and give statistical model and possibly some parameters to fix approximations used in estimation and get out model from that.

For example, Stan-language supports MCMC modelling using Hamiltonian dynamics.

http://mc-stan.org/

king_magic · on April 13, 2015

Just like UML and Prolog.

eximius · on April 13, 2015

Really? "Disruptive"?

vezzy-fnord · on April 13, 2015

Constraint solvers and similar tools can look pretty magical to people who use them for the first time, so I don't blame the OP.

What's more interesting is that we're seeing heightened interest in these techniques again after they were ostensibly sidetracked in favor of statistical methods.

sidrajaram · on April 13, 2015

I'm no expert, but afaik probabilistic programming isn't a new method or technique. It is just wrappers around existing statistical techniques, as an attempt to divorce the details of inference algorithms with model specifications.

I'm not buying in just yet, because although it's nice to talk about model specification as completely independent processes, the availability of fast inference algorithms sometimes dictates what models you should choose. Sometimes less exact models with a larger parameter space that allows you to crunch orders of magnitude larger datasets (with approximate inference algorithms) yield more useful results than better specified models...and sometimes not. The thing if one still needs to know the whens and whys of picking certain models over others, and can't just gloss over the inference details.

troels · on April 13, 2015

Maybe I just don't get the terminology right, but isn't this exactly a statistical method?

sgt101 · on April 13, 2015

The questions are "when does it fall off a cliff" and "does it work for more than one thing". Normally the answers are "as soon as you stop doing toy problems" or "no".

But, this (and Church) look very interesting.

oh_sigh · on April 14, 2015

Sure. If someone here has nothing more than high school math under their belt, they may not understand that this isn't a huge deal.

For example, I refer you to the paper wherein some biologists "discovered" integrals in 1993. http://care.diabetesjournals.org/content/17/2/152.abstract