AlphaCode does some analysis of training data copying in their paper (Sections 6...

ahgamut · on Feb 7, 2022

> It does not seem to be copying from the training data in any meaningful way.

My point is, I would like to verify this claim with different metrics, because we probably have different interpretations of the word "meaningful".

AlphaCode measures similarity between programs via longest common substrings. That's better than nothing, but that would mean two programs that differ only in variable naming would not be considered similar. If two programs differed only in the names of the variables/functions, I would consider that copying.

I think there are better comparisons of structural similarity: compare the ASTs, or the bytecode/assembly code generated, the control flow graphs, or perform SSA and compare the blocks generated. Each of these might have weaknesses as well, but they won't be as obvious as variable renaming, and so we'd get a better idea of what AlphaCode is copying, and therefore a better idea of its full capabilities.

I expect AlphaCode performs well on Python because the training data is dominated by Python, but Python isn't ideal for comparing program structure. I wonder which programming language (given enough training data) would be best suited for language model generation and analysis.

moyix · on Feb 8, 2022

Sure, I think this would be a really interesting study to do! I have been meaning to scrape a big chunk of GitHub in order to do this kind of analysis. I think I would be surprised (just based on my own use of Codex and Copilot) if they were copying at the level of "same code but renamed variables" either. Past that I think comparisons get pretty difficult, and to some degree I would actually be more impressed if it understood enough about programs to be able to mimic higher-level structure without just memorizing text.

Why do you think Python isn't good for comparing program structure? Certainly you could compare ASTs and bytecode pretty easy; I think it's actually much easier to do so than in C/C++ since you don't have to deal with preprocessor junk and compile flags influencing the meaning of the code. There's less available for classical analyses like data and control flow analysis, in part because those are much harder in dynamic languages, but there are some tools out there like the ones used in PyPy for getting SSA CFGs [1].

I feel like many people are equivocating a bit on what they mean by "regurgitating" here. It seems clear that the models have at best a shaky grasp on code semantics (e.g., they are bad at things like predicting what the output of some code will be: https://arxiv.org/abs/2112.00114), and struggle with problems that are very different from anything they've seen before.

[1] https://foss.heptapod.net/pypy/pypy/-/tree/branch/default/rp...

ahgamut · on Feb 8, 2022

> I would actually be more impressed if it understood enough about programs to be able to mimic higher-level structure without just memorizing text.

Agreed.

> Why do you think Python isn't good for comparing program structure?

From recent experience I prefer control flow analysis, or something that results in a graph structure. As you said, that's harder with dynamic languages. I also think some Python features (english-like syntax, f-strings, division converts int to float, whitespace indentation, loops vs generator expressions) make structural comparisons messy, but that may just be bias.

The ideal language would be one with minimal syntax, where we can target a decent range of programs, and obtain as much info as possible about program structure without actually running the program. I've come across LISP-without-macros in Dreamcoder (https://arxiv.org/abs/2006.08381), BF++ (https://arxiv.org/abs/2101.09571), and a couple of others which I can't remember right now. I think the APL family (APL/J/K) would interesting because fewer characters to generate, but each character has a lot of meaning.

Right now I'm looking at flow-based programming (FBP) for this: In FBP the control flow is explicit - the program code describes a directed acyclic graph (DAG), so comparing program structure becomes straightforward (subgraph isomorphism with some heuristics). I'm writing a toy FBP language that draws images (https://github.com/mayahq/flatland), with which I aim to test what these models understand.