Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've sort of run adjacent to the field of machine learning in the last few years, but haven't really dove in to the existing literature. This seems to be a pretty interesting overview.

Out of curiosity, do many implementations of convolutional neural networks take advantage of FFT, DCT, or some other fast orthonormal transform to compute the transition between layers, or are the kernel sizes small enough that there isn't a great advantage to that?



Facebook actually does something like that: https://research.facebook.com/blog/879898285375829/fair-open...

They have a patent on it but did open sourced the code. They claim it's up to 24x faster than standard. But that is only true for an extreme use case, its only 2x faster on average.


The bigger the convolution, the faster it gets because a convolution in real space is a multiplication in frequency space.

It breaks even at 5x5 or so and gets dramatically better shortly thereafter. However, most of the convolutional nets in use rely on 3x3 convolutions because I guess reasons:

http://arxiv.org/pdf/1409.1556.pdf (all 3x3)

http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf (3x3 and 5x5)

There's probably a new Imagenet winner in this somewhere IMO...


One thing I wonder about is whether it is possible to somehow reflect symmetry in the input data in the structure of the neural network?

For example, the usual way to have a DNN learn rotation/scaling/translation is to do data augmentation and simply learn with all the data rotated/translated/shifted.

But there must be a way to have these input space symmetries reflect somehow in the structure of the network?

I tried googling this a bit but wasn't really successful - does anyone know whether this has been done?


So, convolution is by itself an attempt to exploit translation-invariance in the visual world, and typical deep convnets end up picking up a certain amount of scaling tolerance (though I would not call it invariance) by having features that are sensitive to larger and larger patches of the input as you go up the hierarchy of features. This is not real scale-invariance, and many people run a laplacian pyramid of some sort at test time to get it real scale-invariance when eking out the best possible numbers.

Rotation-invariance is probably not really a thing you want. The visual world is not, in fact, rotation-invariant, and the "up" direction on Earth-bound, naturally-occurring images has different statistics than the "down" direction, and you'd like to exploit these. Animal visual systems are not rotation-invariant either; an entertainingly powerful demo of this is "the Thatcher Effect" (https://en.wikipedia.org/wiki/Thatcher_effect).

Reflection across a vertical axis, on the other hand, often is exploitable, at least in image recognition contexts (as opposed to, say, handwriting recognition). If you look at the features image recognition convnets are learning they are often symmetric around some axis or other, or sometimes come in "pairs" of left-hand/right-hand twins. As far as I know nobody has tried to exploit this architecturally in any way other than just data augmentation, but it's a big world out there and people have been trying this stuff for a long time.


I was thinking more about a machine vision context with e.g. different parts coming in at any rotation angle.

I know that some translation invariance comes from e.g. the usual conv+maxpool layer structure, but there must still be several representations existing in the first hidden layer of the network stack, for the different translation shifts?

Especially rotation looks like something that should produce a lot of symmetry and shared parameters, but it also looks difficult enough for me that I rather would like to know about someone with mad math/group theory(?) skills who looked at that.

But thank you for the detailed reply anyways!


The usual way to do Translation Invariance is with the structure of the network resulting in what's called a convolutional neural network (conv+pool actually achieves this).

There have been papers about scale/rotation invariant convnets (again at the structure level) and also Networks that learn invariances without encoding them into the structure.


> There have been papers about scale/rotation invariant convnets (again at the structure level) and also Networks that learn invariances without encoding them into the structure.

The former I am very interested in! Do you have any links?


CNNs aren't rotation invariant on purpose. If they were, you would lose information about how features are oriented. Typically it learns features that are very sensitive to rotation, like vertical edges and horizontal edges.


I am rather thinking about an on/off 'object detected 'for objects with any rotation angle. That symmetry surely must be exploitable somehow in shared parameters of the DNN or similar?

My gut feeling is that the first convolutional layer's kernels, for example, would probably have a 'some are orthogonal' constraint due to this symmetry.


I don't really understand it but I suspect there is some trick going on somewhere. I don't see how you can avoid doing at least one computation for each weight pixel pair in each convolution.



I read it, I just don't understand it.


Most use an im2col/col2im based impl. I've been experimenting with cuda based fft for ours as well though.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: