>Interesting to think about what structures human intelligence has that these mo...

>Interesting to think about what structures human intelligence has that these models don't.

If we get to the gritty details of what gradient descent is doing, we've got a "frame", i.e a matrix or some array of weights contains the possible solution for a problem, then with another input of weights we're matching a probability distribution to minimize the loss function with our training data to form our solution in the "frame". That works for something like image recognition, where the "frame" is just the matrix of pixels, or in language models where we're trying to find the next word-vector given a preceding input.

But take something like what Sir William Rowan Hamilton was doing back in 1843. He know that complex numbers could be represented in points in a plane, and arthimetic could be performed on them, and now he wanted to extend a similar way for points in a space. With triples it is easy to define addition, but the problem was multiplication. In the end, he made an intuitive jump, a pattern recognition when he realized that he could easily define multiplications used quadruples instead, and thus was born the Quaternion that's a staple in 3D graphics today.

If we want to generalize this kind of problem solving into a way that gradient descent can solve, where do we even start? First of all, we don't even know if a solution is possible or coherent or what "direction" we are going towards. It's not a systematic solution, it's rather one that pattern in one branch of mathematics was recognized into another. So perhaps you might use something like Category Theory, but then how are we going to represent this in terms of numbers and convex functions, and is Category Theory even practical enough to easily do this?