Hacker News new | past | comments | ask | show | jobs | submit | albertzeyer's comments login

There is also optical neuromorphic computing, as an alternative to electronic neuromorphic computing like memristors. It's an fascinating field, where you use optical signals to perform analog computing. For example:

https://www.nature.com/articles/s41566-020-00754-y

https://www.nature.com/articles/s44172-022-00024-5

As far as I understood, you can only compute quite small neural networks until the noise signal gets too large, and also only a very limited set of computations works well in photonics.


The issue with optical neuromorphic computing is that the field has been doing the easy part, i.e. the matrix multiplication. We have known for decades that imaging/interference networks can do matrix operations in a massively parallel fashion. The problem is the nonlinear activation function between your layers. People have largely been ignoring this, or just converted back to electrical (now you are limited again by the cost/bandwidth of the electronics).

Seems hard to imagine there’s not some non-linear optical property they could take advantage of

The problem is intensity/power, as discussed previously photon-photon interactions are weak, so you need very high intensities to get a reasonable nonlinear response. The issue is, that optical matrix operations work by spreading out the light over many parallel paths, i.e. reducing the intensity in each path. There might be some clever ways to overcome this, but so far everyone has avoided that problem. They said we did "optical deep learning" what they really did was an optical matrix multiplication, but saying that would not have resulted in a Nature publication.

There is, and people have trained purely optical neural networks:

https://arxiv.org/abs/2208.01623

The real issue is trying to backpropagate those nonlinear optics. You need a second nonlinear optical component that matches the derivative of the first nonlinear optical component. In the paper above, they approximate the derivative by slightly changing the parameters, but that means the training time scales linearly with the number of parameters in each layer.

Note: the authors claim it takes O(sqrt N) time, but they're forgetting that the learning rate mu = o(1/sqrt N) if you want to converge to a minimum:

    Loss(theta + dtheta) = Loss(theta) + dtheta * dLoss(theta) + O(dtheta^2)
                         = Loss(theta) + mu * sqrtN * C (assuming Lipschitz continuous)
    ==>     min(Loss)    = mu * sqrtN * C/2

His goal was not just to solve Atari games. That was already done.

His goal is to develop generic methods. So you could work with more complex games or the physical world for that, as that is what you want in the end. However, his insight is, you can even modify the Atari setting to test this, e.g. to work in realtime, and the added complexity by more complex games doesn't really give you any new additional insights at this point.


But how is this different to what NVIDIA have already done? They have robots that can achieve arbitrary and fluid actions in the real world by training NNs in very accurate GPU simulated environments using physics engines. Moving a little Atari stick around seems like not much compared to sorting through your groceries etc.

The approach NVIDIA are using (and other labs) clearly works. It's not going to be more than a year or two now before robotics is as solved as NLP and chatbots are today.


I think he argues that they would not be able to play Atari games this way (I don't know; maybe I also misunderstood).

But also, he argues a lot about sample efficiency. He wants to develop algorithms/methods/models which can learn much faster / with much fewer data.


> Google's first LLM to use diffusion in place of transformers.

But this is a wrong statement? Google never made this statement? You can have a Transformer diffusion models. Actually Transformers are very standard for all of the discrete diffusion language models, so I would expect Gemini Diffusion also uses Transformers.

Edit Ah sorry, I missed, this was already addressed, also linked in the post: https://news.ycombinator.com/item?id=44057939 Maybe my remaining post is still useful to some.

The difference is, it's an encoder-only Transformer, and not a decoder-only Transformer. I.e. it gets fed in a full sequence (but noisy/corrupted), and it predicts the full correct sequence. And then you can iterate on that. All frames in the sequence can be calculated in parallel, and if you need only a few iterations, this is faster than the sequential decoding in decoder-only models (although speculative decoding also gets you some speedup for similar reasons). Those discrete diffusion models / encoder-only Transformers are usually trained with BERT-like masking, but that's actually an active field of research. It's really a pity that they don't provide any details here (on training and modeling).

I wonder how this relates to Gemini. Does it use the same modeling? Was the model checkpoint even imported from Gemini, and then further finetuned for discrete diffusion? Or knowledge distillation? Or is it just branding?


4GL and 5GL are already taken. So this is the 6GL.

https://en.wikipedia.org/wiki/Programming_language_generatio...

But speaking more seriously, how to get this deterministic?


Fair enough, should have taken a look, I stopped counting when computer magazines buzz about 4GLs faded away.

Probably some kind of formal methods inspired approach, declarative maybe, and less imperative coding.

We should take an Alan Kay and Bret Victor like point of view where AI based programming is going to be in a decade from now, not where it is today.


That future is far from inevitable, the first question we SHOULD ask is if it's a good idea to go down this path.

It seems the main complaint is about confusing shapes / dimensions. Xarray has already be mentioned, but this is a broader concept, called named dimensions, sometimes also named tensors, named axes, labeled tensors, etc, which has often been proposed before, and many implementations exists.

https://nlp.seas.harvard.edu/NamedTensor

https://namedtensor.github.io/

https://docs.pytorch.org/docs/stable/named_tensor.html

https://github.com/JuliaArrays/AxisArrays.jl

https://github.com/ofnote/tsalib

https://github.com/google-deepmind/tensor_annotations

https://github.com/google-deepmind/penzai

https://github.com/tensorflow/mesh/

https://github.com/facebookresearch/torchdim

https://xarray.dev/

https://returnn.readthedocs.io/en/latest/getting_started/dat... (that's my own development; unfortunately the doc is very much outdated on that)


Someone just need to write a new named multi-dimentional array processing library in Dlang with Mir since it's more intuitive and faster than venerable Fortran library that's being widely used by array languages including Matlab, Julia, etc [1], based the implementation on the methodology presented in the seminal book by Hart [2], problem solved.

[1] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen (2016):

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

[2] Multidimensional Analysis: Algebras and Systems for Science and Engineering (1993):

https://link.springer.com/book/10.1007/978-1-4612-4208-6


Just by the mathematical definition of interpolation (https://en.wikipedia.org/wiki/Interpolation), any function which approximates the given regression points (training data), and is defined for values in between (unseed data), will interpolate it. Maybe you think about linear interpolation specifically, but there are many types of interpolation, and any mathematical function, any neural network is just another form of interpolation.

Interpolation is also related to extrapolation. In higher dimensional spaces, the distinction is not so clear. In terms of machine learning, you would call this generalization.

The question is more, is it a good type of interpolation/extrapolation/generalization. You measure that on a test set.

And mathematically speaking, your brain also is just doing another type of interpolation/extrapolation.


Are there any details on what they changed to improve over other existing models?


It links to this file: https://drive.google.com/file/d/1j1ofmm8iBaVreGC5TSF1oLsrOqB...

But all but the first page seems to be missing in this PDF? There is just an abstract and (partial) outline.


That link is dead now?

But I guess you mean this? https://www.sciencedirect.com/science/article/pii/S000437022...

"Reward is enough", David Silver, Satinder Singh, Doina Precup, Richard S. Sutton, 2021.


Why do you say FlexAttention is too buggy? I have heard about a lot of successful usages of it, and never heard about any such problems.

Also note, depending on your model dimensions and sequence lengths, often the attention computation plays only a minor role (maybe 10% overall or so), and the MLP computation dominates.


Last time I tried it I encountered both showstopper bugs (it was completely obviously broken) and subtle correctness bugs (it looked like it was working, but since I'm paranoid I have unit tests for everything and numerically the errors were too big compared to what you'd get with eager attention or Flash Attention), and it was too slow for my taste compared to Flash Attention so I just dropped it. And I wasn't even doing anything super exotic with it.

Maybe it's better now, but I'd still consider using FlexAttention without a corresponding unit test checking its accuracy against an equivalent eager implementation completely irresponsible.


What unit tests do you use for nn modules and how do you come up with them?


Unit tests which test random inputs across different sizes (e.g. with different number of heads, head sizes, embedding dimensions, etc.) and compare two different implementations' output to each other (e.g. attention implemented manually in an eager fashion vs a bunch of accelerated attention libraries).

Also more integration-like tests where I take an already pretrained model, load it using an established library (e.g. Huggingface Transformers) and I also load the very same checkpoint into my reimplementation (where I vary the implementation, e.g. swap the attention implementation) and compare the outputs. Funnily enough, I recently even found a bug in HF's Transformers this way when I updated to a newer version and my previously matching output was not matching anymore.


I would like to know too


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: