That's a fair point. My assumption was that the neural network isn't going to be as efficient at the FFT but the article doesn't include that type of assessment.
NN is doing matrix multiplications anyway. That's what it does, basically.
The property of matrix multiplications is that they are composable, i.e. `X * (Y * z) = (X * Y) * z` that is, in the end you only need one matrix.
So what this means in practice is that you have FFT for free. NN is doing a matrix multiplication anyway. Discrete time Fourier Transform is a matrix multiplication. Thus it can simply fold DTFT together with whatever other transform it is doing - it doesn't cost anything.