I tried a basic version of that 1.5 weeks ago. At first it was generating random...

Retric · on March 18, 2013

If you actually want to go down this path then you want a language with as little redundancy as possible. Comments may be useful to humans but they needlessly complicate the search space. Abstractly you want to think about the how your language is parsed so you don't for example try different variable names.

Though if you spend enough time your probably going to be reimplementing some type of http://en.wikipedia.org/wiki/Genetic_programming

omra · on March 18, 2013

[Slash/A](https://github.com/arturadib/slash-a) is specially designed for genetic programming / random generation. No matter how you piece together the instructions, it is still a valid program.

Falling3 · on March 18, 2013

Like maybe Brainfuck?

dmitshur · on March 18, 2013

Yep, I was considering directly generating the AST.

nathell · on March 18, 2013

Try this in Forth, or Factor. And be sure to read up on genetic programming.

http://en.wikipedia.org/wiki/Genetic_programming

dmitshur · on March 18, 2013

I will read more about GP for future attempts, thanks.

recuter · on March 18, 2013

This is an exciting exchange for me because all joking aside I had this idea in the past and believe it should be a more active area of research.

Why Forth?

I doubt it is possible to solve "serious" problems with this approach but there is a whole class of work that is solved by mediocre programmers with copy pasting. We strive to automate other jobs, why not these? :)

We should come up with some sort of Turing like test for this. Like some small simple Wordpress job.

vidarh · on March 19, 2013

Genetic programming is an active area of research, and there has been a number of successes. See John R. Koza's work: http://www.genetic-programming.com/johnkoza.html for starters.

It's not really, so far at least, suitable for replacing work done by mediocre programmers, because the setup cost is so high: The hard work is defining the fitness function and symbol set and other factors, and to pay off this requires problems where it is easier to recognise a good result than writing the algorithm to achieve it.

E.g. a sort function does not fall in that space: Once you've specified how you want your data sorted, you've usually done most of the work.

But once you've specified the fitness function sufficiently well, and figured out the inputs etc., there are a lot of other search algorithms that often will perform better.

I'm very fascinated by GP too (though I've never had time to truly delve into it), but without combining it with mechanisms to take a large chunk of the specification work out of the equation, it remains confined to fairly specific types of problems.

RodgerTheGreat · on March 18, 2013

Forth, Factor, APL or any other concatenative language provides the benefit of a "point free" style in which there are no explicitly named variables- that's one way to reduce the possibility space of programs. In the case of Factor (or Forth with an appropriate DSL) you could further use type information to ensure that your program generator only used words in sequence whose stack effects match up properly- ie a 'valid' program. Concatenative languages also tend to have an extremely simple grammar- Forth is just a sequence of tokens and numbers.

fizx · on March 18, 2013

Perhaps worth noting: java bytecode is a concatenative language. There are probably more commercially interesting java corpora than forth corpora.

If you want to get paid to play around with this thought, my contact info is in my HN profile.

RodgerTheGreat · on March 18, 2013

Java bytecode is stack-based, but it isn't really concatenative. The term "concatenative" refers to how code fragments can be composed via concatenation. This means that simple textual substitution is sufficient for inlining code or breaking code into functions. For example, consider a Forth word which determines whether a number is divisible by three or five:

  : fizzbuzzable  dup 3 mod 0= swap 5 mod 0= or ;

There's a repeated pattern here, so we can textually excise it and make it into a named word without changing any of the structure of the surrounding program:

  : /?            mod 0=                ;
  : fizzbuzzable  dup 3 /? swap 5 /? or ;

Or we could break it down a different way by excising different fragments:

  : /3?           3 mod 0=            ;
  : /5?           5 mod 0=            ;
  : fizzbuzzable  dup /3? swap /5? or ;

(Obviously not the best real-world example, but hopefully it illustrates the idea.)

Java bytecode, on the other hand, uses local variable references and activation records. This means that inlining or breaking out a procedure has pretty much the same problems as inlining or breaking out a procedure manually in C- variable names may clash, new arguments have to be threaded around, parts of expressions may need to be stored in temporary variables, etc. the JVM additionally enforces many constraints on "well-formed" bytecode at class load time[1] which could make it hard to generate valid programs by chance. Overall, Trying to "harvest" java bytecode from the wild could be useful, but I think that would be much harder than it sounds at first.

[1] http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.htm... (and below)

randomsearch · on March 19, 2013

"...believe it should be a more active area of research."

8000 papers so far:

http://www.cs.bham.ac.uk/~wbl/biblio/

morphics · on March 18, 2013

If you want to learn about GP, the Genetic Programming Field Guide - http://www.gp-field-guide.org.uk/ - is an awesome book, it taught me a lot. In fact, I liked it so much, I bought a hard copy. Highly recommended!

randomsearch · on March 19, 2013

I'd also recommend Wolfgang Banzhaf's book on GP:

http://www.amazon.co.uk/Genetic-Programming-Introduction-Art...

veaviticus · on March 18, 2013

Check out this article. Same concept, using genetic programming

http://www.primaryobjects.com/CMS/Article149.aspx

dmitshur · on March 18, 2013

That's pretty sweet, thanks.

Yeah, I can see why using Brainfuck is a good idea. You're basically restricting yourself to generating only the programs that compile rather than wasting time on gibberish.

errnoh · on March 18, 2013

Hah, fun idea.

Modified it a bit to run in parallel. I guess I'll leave it on for couple hours and see if it gets anywhere.

current output:

    1363637914 2013-03-18 22:18:34.118051 +0200 EET Stats: 0/53659699 (0%) good/tries, 223559.91320127275 ops/sec

dmitshur · on March 18, 2013

Cool, let me know the best you get (and submit a PR with your parallel patch if you want).

You can tweak the range of the number of generated characters, the length of Markov chains, the minimum main body clauses, etc. With my original config it should give you a valid program every hour~few hours or so.