Using a probabilistic program to generate and interpret a program is an interesting idea, but I think you're missing the main problem.
How do you sample from the set of programs that produce the correct (or approximately correct) output?
You could use rejection sampling, but that would take very long as only a tiny fraction of all possible programs produce the desired output.
You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.
You mention hyperparameters, but what would those be? If you're building a probabilistic grammar, they might be things like emissions probabilities... but tweaking the proportion of "if" statements you produce, as opposed to, say, function calls, is hardly going to make you better at sampling from the set of correct programs.
In general, the idea of meta-learning is a powerful one, but without a way to guide the search, I don't see it as feasible.
" There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space."
You correctly identify the main hurdle. I admit I am not sure it will work.
However, I think it might be possible to design the programming language such that changing a few bits doesn't usually radically alter the output.
For example, let's say you are trying to learn a grammar to draw a blue irregular polygon. If there are primitives with fewer parameters that can get you an approximate output, say a circle, this makes possible a learning path that looks like "circle"->"blue circle"->"simple blue polygon"->"complex blue polygon". In addition to that, if the grammar rules that generate similar output can be clustered in the space of grammar rules, small jumps in this space may give you small output changes. Using bayesian priors will naturally use the simpler, fewer parameter shapes first and pivot to more complex ones as enough information is learned while, I think, creating these more complex rules close to the simple ones in the space of grammar rules. That is my hope anyways. I got it working as expected-ish with a simple vision example like I just described.
>You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.
How do you sample from the set of programs that produce the correct (or approximately correct) output?
You could use rejection sampling, but that would take very long as only a tiny fraction of all possible programs produce the desired output.
You could use a MCMC method, but the problem here is designing a proposal density. There is no reason to expect programs which are almost correct to be syntactically close to programs which aren't. Changing a single bit in a program tends to radically alter its behavior, in a very non linear way. Program space does not map smoothly to output space.
You mention hyperparameters, but what would those be? If you're building a probabilistic grammar, they might be things like emissions probabilities... but tweaking the proportion of "if" statements you produce, as opposed to, say, function calls, is hardly going to make you better at sampling from the set of correct programs.
In general, the idea of meta-learning is a powerful one, but without a way to guide the search, I don't see it as feasible.