Hacker News new | past | comments | ask | show | jobs | submit login

I have been out of the machine learning field for years and I haven't look into deep learning methods even though they are intriguing (just not sufficiently bayesian to pique my curiosity enough to spend my limited free time on it). I do believe that automatically building a hierarchical structure (as I assume happens in deep learning) is the way to abstract away complexity but I think this is achievable within a generative bayesian monte carlo approach.

For example, I have been toying with using MCMH similarly to how it is used in dippl.org to write a kind of probabilistic program that generates other programs by sampling from a probabilistic programming language grammar. A bit like with the arithmetic generator here: https://probmods.org/learning-as-conditional-inference.html#... but for whole programs.

After the MCMH has converged to a program that generates approximately correct output, you can tune grammar hyperparameters on the higher likelihood output program samples so that next time it will converge faster.

I don't know if this counts as approximating "the generative distribution and the inference distribution at the same time" under your definition but my hope is that the learned grammar rules are good abstractions for hierarchical generators of learned concepts.

Of course the worry is that my approach will not converge very fast but there are reasons to think that having a suitably arranged hyperparametrized probabilistic grammar might use the Occams' razor inherent in bayesian methods to produce exponentially fewer, simpler grammar rules that generate approximations when it doesn't have enough data to converge to more complex and precise programs and that these simple rules which rely on fewer parameters might provide the necessary intermediate steps to then pivot to more complex rules. These smaller steps help MCMH to find a convergence path. Not sure how well it will work for complex problems however. My idea is still half baked. I have ton's of loose ends and details I have not figured out, some of which I might not even have a solution, as well as little time to work on this (this is just a hobby for me).

Anyways, all that to say that probabilistic programming can go beyond just hardcoding a generative model and running it.




Using a probabilistic program to generate and interpret a program is an interesting idea, but I think you're missing the main problem.

How do you sample from the set of programs that produce the correct (or approximately correct) output?

You could use rejection sampling, but that would take very long as only a tiny fraction of all possible programs produce the desired output.

You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.

You mention hyperparameters, but what would those be? If you're building a probabilistic grammar, they might be things like emissions probabilities... but tweaking the proportion of "if" statements you produce, as opposed to, say, function calls, is hardly going to make you better at sampling from the set of correct programs.

In general, the idea of meta-learning is a powerful one, but without a way to guide the search, I don't see it as feasible.


" There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space."

You correctly identify the main hurdle. I admit I am not sure it will work.

However, I think it might be possible to design the programming language such that changing a few bits doesn't usually radically alter the output.

For example, let's say you are trying to learn a grammar to draw a blue irregular polygon. If there are primitives with fewer parameters that can get you an approximate output, say a circle, this makes possible a learning path that looks like "circle"->"blue circle"->"simple blue polygon"->"complex blue polygon". In addition to that, if the grammar rules that generate similar output can be clustered in the space of grammar rules, small jumps in this space may give you small output changes. Using bayesian priors will naturally use the simpler, fewer parameter shapes first and pivot to more complex ones as enough information is learned while, I think, creating these more complex rules close to the simple ones in the space of grammar rules. That is my hope anyways. I got it working as expected-ish with a simple vision example like I just described.


>You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.

Ahaha, "this sounds like a job for topology!"




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: