I'm a firm believer that resources which raise the bar for math and computing education are huge catalysts for innovation. The fact that resources like BetterExplained, Paul's Math Notes, 3Blue1Brown's YouTube Channel, Ben Eater's YouTube channel, and the entire body of high quality MOOCs are just available for free over the Internet is probably one of my favorite accomplishments of the human race in the 21st century. While researchers, inventors, startup founders, and all other types of entrepreneurs are pushing the envelope on what's possible, there are literally thousands of creators and educators who are doing their best to bring some of that knowledge to the common man.
My own experience with academia has lead me to believe that there problems that are truly difficult and complex, but there are also problems which are described as difficult because so many people were exposed to the topic through a resource (typically another person) who had little experience (or interest) in effectively teaching the subject.
Definitely agree - I struggled beyond basic integration in High School, and my "why" and "what is it" type questions about calculus were met with annoyed "read the book" replies and young me lost interest in higher maths at a time when my mind was more plastic.
As an adult having these personalities enthusiastically breaking down the fundamentals and using analogies to help cement the concepts intuitively has been a taste of what I imagine having a brilliant and engaging math professor would be like.
because so many people were exposed to the topic through a resource (typically another person) who had little experience (or interest) in effectively teaching the subject.
I used to homeschool and one of my sons had terrible baggage about math by the time I pulled him out of public school when he was 11. And somewhere along the way, I learned that a lot of elementary school teachers are women who weren't doing well in math and got encouraged to go become an elementary school teacher because, basically, sexism is alive and well and the world goes "Oh, I know what a woman who can't cut it in college is qualified for: Dealing with small kids all day!"
So a lot of elementary school teachers are women who suck at math and were also never told 'Get over it and get a degree you want.' Instead, they were told 'Oh, go teach kiddies. That's what you are qualified for.'
And I know this is a gender difference in terms of the world treating men and women differently -- a social difference, not a biological one -- because I'm a woman who is good at math (or was at one time) and was married to a man who was terrible at it. No one ever told him (a career soldier) "Oh, go become an elementary school teacher!"
Instead, he got his wife to tutor him in college math even though he found it so difficult he threw his pencil across the room one day when I showed him the right way to do it and then I told him "You can be nice or go find another tutor."
It's interesting you mention the solider aspect because the US Marine Corps went through a similar evaluation.
For a long time, being a drill instructor (the guy yelling at you in boot camp) was considered a low status position and therefore a career dead end. This in turn led to the DI spots being filled primarily by people who were in trouble, didn't care or were "bad apples".
At some point, the questions were raised: "Wait, why are we putting some of our worst Marines to train new Marines? Wouldn't it it make more sense to have our BEST Marines train the next generation?". The follow up was then: so how do we actually make that happen?
The answer was to make being a DI one of the stepping stones to higher level ranks. This dramatically increased the quality of DIs and in turn, raised the quality of the training and finally the Marines.
There is a similar story in some countries with socialized medicine. E.g. to get a license to practice medicine privately, you first have to spend N years working in a public hospital. This ensures that you can get top quality people working even in a public healthcare based system.
> a lot of elementary school teachers are women who weren’t doing well at math and got encouraged to go become an elementary school teacher because, basically sexism is alive and well
I hear your frustration, and I’m glad that there has been progress in this regard. However I think that what the GP was saying and what you’re saying are two different issues:
The GP is referring to particular teaching styles which might be overly formal (in the mathematical sense) or declarative (this is convolution, that’s how it is, remember it) as opposed to intuitive (which might not convoy the whole truth but helps some if not most people understand and remember the concept better). This issue can and often takes place in higher education where the teacher or lecturer is expected to be highly qualified.
I remember being frustrated with the way my linear algebra and multivariable calculus classes were taught. Both generalize to n dimensions but many operations and theorems can often be understood intuitively in two or three dimensions. The courses I took were extremely formal with not a minute spent on the graphical interpretation which would have helped so much. When I raised the issue with two separate professors, both were of the opinion that the graphical approach was frivolous because it wasn’t precise. But that argument misses the point of the lecture theatre space ...
To my mind, a system that shunts people into teaching elementary school in a way that actively creates early educators with a math phobia has got to have serious long term consequences, even for a lot of people who are good at it. We all get told "math is so hard," not just the bad students.
People who think a thing is hard and must be dealt with in a serious manner will use formal methods to ward off their fear of the topic or their fears of social perception. It takes a great deal of confidence to say "We are just going to have fun today and play with this wonderful topic!"
I agree in general its an issue when basics are being thought by any person who does not really understand what they are teaching. They end up tecaching the rote method and not the intuition. Happens all the time in maths.
I think the larger issue here is that a lot of elementary school teachers are people who have a math phobia because their difficulties with math derailed their career dreams. So they not only are bad at teaching it, the primary thing they teach their students is that "Math is scary stuff and something to be feared."
And they teach it to really young kids who are very impressionable and don't have some means to avoid being brainwashed into believing it.
My one and only goal with homeschooling my son was to get him past the math phobia he had developed and teach him "Math is your friend." He's genuinely bad with numbers, but he no longer has a math phobia and has a better grasp of the concepts than a lot of people, even though he can't calculate anything in his head and routinely asks me or his brother to calculate it for him.
A math phobia is crippling and many people never get over it.
I wholeheartedly agree on the point about decentralizing and modernizing education, and thank you for reminding me of this. We still have a massive bias towards academic institutions to work against, but the fact that knowledge is so easily accessible, and of such a high quality, is pretty amazing.
Why do you think academic institutions are something to work against to? I'm in academia and here everybody is enthralled and really happy about the explosion of high-quality, freely available learning resources.
The only kind of complain that I have ever heard (and very rarely so) is because some of this material is supported by videos in a platform with shitty terms of service.
But who has that bias? Is it you? People in academia certainly not. It would be absurd. As if medical doctors and nurses would be against the education of the population for lifesaving techniques and for healthy eating.
I'm split on this. On one hand I appreciate these kinds of posts because they bring math with intuition to broader masses.
On the other hand, it makes people believe that they have understood what's there to understand and move on. In a sense, they contribute to the Dunning-Kruger effect.
The article illustrates this quite well. The intuition is presented well, but is no substitute for actually learning math. The first two questions in the end:
> Why is one function reversed?
> Why is convolution commutative?
With having studied math "properly", it's immediately obvious that they are answers to each other. Defining it with the "reversal" on one function is what makes the operation commutative by introducing a somewhat hidden symmetry and that's a quite common trick in mathematics.
The other two questions are also quite clear with some actual math background. Not M.Sc. level math, just some proper calculus. You just need to sit down and do the math, which can take half a page of pen-on-paper. That's what math is, you need to work it.
Again, don't get me wrong, I appreciate the effort. But those types of posts don't make you more fluid in math than reading a popular science article about Alpha Centauri makes you an astronomer or an expose about Covid-19 makes you an epidemioloist. It's popular science and can't replace the actual study of the subject.
It's a good goal from just a humanistic standpoint and because learning should be recreational but from an innovation standpoint it's honestly likely overrated.
Innovation is the result of very particular institutions and resources being coordinated and people networking and clustering tightly, which is why, despite incredibly dispersion of knowledge over the internet, innovation and VC money actually still is so geographically concentrated.
I think these online learning resources are good but the expectations should be correct, they're not tools to kickstart innovation, which is much more tacit and reliant on close interaction between people rather than knowledge.
Internet has some amazing learning resources these days. Not just articles like these and videos, but also interactive explorables. Even my 7yo was able to understand basic aspects of game theory and complex systems with the help of explorables made by NickyCase, MeltingAsphalt etc. Most recently, I taught myself to not just solve 4x4 Rubik's cube but understand its connections to ideas in abstract algebra such as commutators. I feel extreme gratitude for all the people who create such enlightening material and offer it freely, without asking for a paid subscription to their Udemy course.
> I remember as a graduate student that Ingrid Daubechies frequently referred to convolution by a bump function as "blurring" - its effect on images is similar to what a short-sighted person experiences when taking off his or her glasses (and, indeed, if one works through the geometric optics, convolution is not a bad first approximation for this effect). I found this to be very helpful, not just for understanding convolution per se, but as a lesson that one should try to use physical intuition to model mathematical concepts whenever one can.
> More generally, if one thinks of functions as fuzzy versions of points, then convolution is the fuzzy version of addition (or sometimes multiplication, depending on the context). The probabilistic interpretation is one example of this (where the fuzz is a a probability distribution), but one can also have signed, complex-valued, or vector-valued fuzz, of course.
Tao's explanation seems to be the deepest one so far.
Understanding convolution different contexts: geometry, vector algebra, statistics, signal progressing, control theory, functional analysis .. etc. just deepens the understanding.
Instead of finding one definition and then seeing everything trough it, one should try to find how different viewpoints are the same and learn to switch.
I might be wrong (so happy to be corrected by a native speaker) but I liked when I heard that the German word for convolution is "faltung" which translates to "folding". This is a much nicer word to metaphorically understand the operation. Maybe it still confuses the hell out of German undergrads though?
Probably due to the really idiosyncratic way I think about things and from the post it is not at all clear what I mean, sorry for the imprecision! I think of cross correlation of 1D signals with a kernel as sliding one of the functions over the other [0]. In my head I imagine drawing the signal and kernel next to each other one a big piece of paper and literally folding it over and sliding it back over the signal to intuit the convolution (with an imaginary see through piece of paper). I really don't think that was obvious from what I posted, so definitely not dyscalculia!
Tao is likely inviting us (just as many physical/probabilistic laws do) to view any arbitrary function as relatively "thicker"/"fuzzier" than an infinitely-thin, infinitely-tall spike function at a certain value: the Dirac delta function (https://en.wikipedia.org/wiki/Dirac_delta_function). If you convolve ≡ integrate this Dirac delta function (located at some value x) against any function g(t), by construction the integral is zero everywhere except at t=x, so the result is an infinitely thin slice of g at x, exactly g(x) (the 'sifting property,' https://math.stackexchange.com/questions/1015498/convolution...).
Now imagine you begin to thicken/fuzz the spike; now you begin to accumulate the behavior of g(t) not just exactly at x, but also at points nearby, getting a schmeared representation of g. Deforming our spike to an arbitrary function of interest in this way gives an arbitrary convolution schmear.
Convolution is a correlation with a reversed signal.
Correlation is a generalized dot product: multiplying corresponding pairs of values from two signals, and then adding the factors together. The result is zero if the signals are orthogonal (like the dot product of two vectors in 2D or 3D that are at 90 degrees).
The intuition behind the reversed signal comes from processing in the time domain. There are application in which we convolve a new signal with a segment based on a captured historic signal. That function is reversed so that the newest sample from the new signal correlates with the least recently captured sample in the historic signal.
For instance, we can use an impulse signal (like a clap) to capture the echoes from an acoustic space. That signal can then be convolved with audio data in order to give it a reverberation similar to what would be produced by that acoustic space. The captured impulse response has to be reversed, because echoes come in reverse. In the impulse capture, the longest echo is found on the far right. When we apply the impulse to simulate the echo, it has to be on he far left, because it has to correlate with an old sample of he input.
E.g. if we are at t = 0, and want to hear the 500 ms echo, that echo corresponds to a past sound sample that is now at t = -500. When are capturing the impulse response, then we issue the gunshot or clap at t = 0, and the 500 ms echo comes at t = 500.
The earliest reflections have to apply to more recent signal, but they are based on the least recently captured data.
I don't think you can really call it a generalized dot product, because it doesn't map to a scalar. The inner product is the well accepted definition of a generalized dot product, and convolution does not follow the axioms that an inner product must follow.
Its nearly the same thing, isn't it? If you denote by Tx the left-shift operator defined by (Tx f)(y) = f(y+x), then the correlation of f and g evaluated at x is precisely the dot product of f and Tx g. If you evaluate your function at a certain point, you obtain a scalar product.
> (...) then the correlation of f and g evaluated at x
It really isn't the same, and oddly enough you unknowingly show that off, by mentioning that convolution is the function that maps input functions to the output function, but the dot product is at best a single point evaluated with the output function.
You should be more careful and precise in your attempts to describe these things. Inverting the phase of a signal is entirely different than inverting the signal. Time-reversed signals are not the same as phase-reversed signals.
> The result is zero if the signals are orthogonal (like the dot product of two vectors in 2D or 3D that are at 90 degrees).
Here is a bit more rambling in the direction of linear algebra and probability: you can define a vector space on the set of all random variables, wherein each random variable is a vector. Once you do so, you can further impose an inner product space with the traditional covariance function as the inner product. If the two random variables are orthogonal with respect to this inner product and this space, by definition their covariance is equal to 0. It follows by the definition of correlation from covariance that their correlation will also be 0, i.e. orthogonality implies two variables are uncorrelated in this space :)
Strictly speaking the correlation function is not the inner product in this instance, but practically the result is the same.
> Convolution is a correlation with a reversed signal.
Interesting - when I learnt it, convolution came first.
I like the visual 'sliding' explanation best. Perhaps partly because that's what I was shown, but also because, I think, how else would you explain any integral? I mean, you could, but would you?
Even to a blind person (so I couldn't draw a diagram) I'd explain integration as 'picture a graph ... it's that bit ... it's adding this and this and this ... just like when you multiply two sides of a rectangle ...'
And multiplication is a fancy convolution (with carries).
Fast convolution algorithms use FFT and run in O(n log n) time, which is why fast multiplication algorithms are based on FFT and why everyone was looking for O(n log n) one, which was found like a year ago.
(Though from what i can tell in "Faster Convolutions" section the article claims you can easily build O(n log n) multiplication using FFT which doesn't work as it ignores carries)
You can build a O(n log n) multiplication with FFT based convolutions.
The trick is to only put as many bits into each number as are guaranteed not to overflow. So if the convolution is n long and you put B bits in to each "digit" then the output needs to be able to hold log(n)+2*B bits.
So if your convolution is done with double precision with 56 bits of precision and n = 1 million, log(n) = 20 so you put (56 -20)/2 = 18 bits into each "digit". You do your convolution, and ripple the carries at the end.
This is the way Prime95 does its maths, though it uses a IBDWT which through a bit of extra maths magic halves the length of the FFT needed. While it is rippling the carries at the end it checks to see if each number is pretty nearly an integer - if it isn't them something has gone wrong and it blows up with "FATAL ERROR: Rounding was 0.5, expected less than 0.4".
Basically, you can't. It's the overflows that make it so hard. It's complex, even though we all had the intuition that it should be done (and it should be simple to do)
This requires padding out an O(n)-digit number to memory size O(n log n), with addition operations of numbers with log(n) bits deemed to take O(1) time.
If this is allowed by your model of computation, then you could use this magical O(1) addition to to implement multiplication in O(n).
For example, starting from decimal representation, you can calculate the product of five digit numbers, ("abcde" * "fghij") with the following, where capitalized variables X and S are the magic O(1)-addition integer types:
let X = to_magic_integer("abcde")
let y = "fghij".reverse()
let S = to_magic_integer("0")
for d = 0 to 4
for i = 1 to y[d]
S += X
end
X = X+X+X+X+X + X+X+X+X+X
end
return S
(You can implement to_magic_integer with basically the same algorithm in O(n).)
> addition operations of numbers with an arbitrary number of bits deemed to take O(1) time
?
I was picturing a more restrictive model of computation in which your claim is along the lines of
> There's an O(n)-time algorithm, which, given two numbers with n digits and a magic black box that can add log(n)-digit numbers together in O(1) time, will output the product of those two numbers.
Fun fact, you can construct and store a lookup table for sums of all possible pairs of O(log(n))-digit numbers in O(n) space. (And O(n) time, probably?)
So if your model of computation lets you look things up in a table like that in O(1) time, you can kind of get the situation I'm describing above. Maybe your model of computation shouldn't let you do this, but some do. Often if an algorithm takes up O(foo) space, we don't worry about the fact that lookups into that space should take Theta(log(foo)) time (to even read the address) rather than O(1) time, because things are assumed to "fit into memory".
Another fun way to be fussy along these lines is that the usual algorithms for sorting a list of n distinct items only give bounds of O(n (log n)^2) time. Each item takes around log(n) bits to even store, so comparing two of those can't be done in time O(1) in the worst case (in a model that forbids the kind of "cheating" I'm describing above). The best uniform bound you can hope for is time O(log n) per comparison. (Though in some sense the extra log(n) factor might "drop back out", because it takes O(n log(n)) space to even write down an input to this algorithm.)
I was meaning arbitrary numbers of digits, yes. In the sense that the sets {ceil(log2(n >= 1))} = {n >= 0}, and generally taking for granted that the model doesn't change based on the value n.
I was introduced to convolution in a undergraduate signals/systems course in continuous time where it's basically magic that you memorize to pass your course. I think a better introduction would be through discrete convolution by reexamining polynomial multiplication (which is convolution through a different lens - the coefficients of the product of two polynomials is the convolution of their coefficients).
That serves as a less magical introduction to the operator. You can then point out that the polynomials whose coefficients one convolves can be considered power series, which has a nice interlude into the Z transform and its usefulness as an analytical tool when working with convolutions (and then on to the Fourier transform, etc).
I was rather surprised that it wasn't discrete convolution in the article, after the hospital analogy. Perhaps even with finite summation first, and generalized afterwards; otherwise there's a bit of a leap.
My fist taste of convolution was in a discrete signals class while studying for a B.Eng. in Electrical Engineering. You’re right it was an excellent approach!
Convolution is digit-based multiplication, so why not start with that?
100101 * 23 = 2302323. If you defer the carry over to the end, the digit-based steps we do is convolution. That said, giving lots of examples like those in the article is useful.
Next up looking at multiplying two polynomials in a single variable.
Edit: changed digit-wise to digit-based to make it clear that I'm not asking to multiply corresponding digits.
This example is great, hope you don't mind if I spell it out how your impulse weight f(d) and impulse response functions g(d) are functions over d (each associated with 10^d):
You can see how when there is an impulse weight "x" in digit space entry d, the impulse response "3x" comes at d, and the impulse response "2x" comes at d+1. Note how f and g can be anything and it convolution (f * g) is still their multiplication (except for the obvious rub when f*g(d) is greater than 9, so you need to add a rule about that).
A polynomial with a single variable is more or less what you would mean by "use base-infinity digits" in the first place.
Roughly, in base 10, the number represented by the sequence of digits d_0, d_1, ..., d_n is d_0 + d_1 * 10 + d_2 * 10^2 + ... + d_n * 10^n.
In "base infinity", the "number" represented by the sequence of digits d_0, d_1, ..., d_n is the polynomial d_0 + d_1 * x + d_2 * x^2 + ... + d_n * x^n.
Multiplying these "numbers" (polynomials) is like decimal multiplication, except the digits can be arbitrarily high and you do no carrying at all.
You really aren't. That assertion is like commenting that elementary arithmetics is discrete signal analysis because 1+1=2, which happens to match the amplitude of a resonance of a normalized vibration mode.
It isn't. It's a cherry picked example. A broken clock which coincides with the current time.
> It isn't. It's a cherry picked example. A broken clock which coincides with the current time.
No. It really is. Look at the example.
Nobody can convince you if you are already set up against that idea, but look at what happens when you multiply two polyomials P(z)Q(z). You obtain a polynomial whose coefficients are the convolution of those of P and Q. Now, a decimal number is "just" a polynomial evaluated on z=10. And a vibrating string is "just" a polynomial evaluated at z=exp(it). The algebra of convolutions and products is exactly the same in all three cases.
> The only cherry picking is that the numbers need to be small enough that the decimal digit does not overflow
Aren't you restating my point?
> So, if you convert to a digit of a large enough base you can do any multiplication with just a convolution and no carry.
There is no convolution at all. There's only a cherry-picked example of how plain old multiplication feels similar to a sliding dot product, which for some reason some people confuse with convolution. This is not helpful, and only adds to the confusion already expressed.
You’re right that the analogy between elementary-school-multiplication and convolution only works for multiplication without carrying. That still leaves a large amount of possible multiplication instances, for the term “cherry picking” to apply here.
You’re wrong to suggest that elementary-school-multiplication and (discrete) convolution are so dissimilar that the analogy is so unhelpful that it only adds to everyone’s confusion. Maybe to yours, but not to everyone’s.
If I understand your objection correctly (other than the already conceded fact that the analogy does not work in case of multiplication with carrying), you already have internalized that convolution is an operator on 2 functions, resulting in a function. And since you have already internalized that, the analogy with any operator on two natural numbers resulting in a natural number confuses you — although I assume you can see the similarity in the sliding operation.
For people who have not internalized what a convolution is, the visualisationg of the sliding operation, and the analogy with elementary-school-multiplication is the helpful bit.
To clarify why the operations are really similar: consider a natural number as a function that takes a natural number as argument, and returns the digit (between 0 and 9 inclusive) for that decimal position.
So f = 2 031 is considered to be: n → f(n) = if n = 3 then return 2 else if n = 1 then return 3 else if n = 0 then return 1 else return 0 (end if).
And g = 320 is considered to be: n → g(n) = if n = 2 then return 3 else if n = 1 then return 2 else return 0 (end if).
What is then the discrete convolution f ∗ g?
It’s f ∗ g = n → (f ∗ g)(n) = f(3) × g(n – 3) + f(1) × g(n – 1) + f(0) × g(n)
= n → if n = 5 then return 6 else if n = 4 then return 4 else if n = 3 or n = 2 then return 9 else n = 1 then return 2 else return 0 (end if).
> You’re wrong to suggest that elementary-school-multiplication and (discrete) convolution are so dissimilar (...)
They are way more than dissimilar, they are radically different concepts altogether.
Convolution is a function, or an operator that outputs a function given two input functions that share the same domain.
Plain old algebra over real numbers is nothing of the sort.
You're trying to compare a function, with its domain and codomain, with an operation between scalars that results in a scalar. It's way more than apples to oranges. It's apples to orange tree plantations.
The standard multiplication algorithm is an algorithm for taking base-b representations of two numbers and outputting the base-b representation of the product of those numbers. A base-b representation of a number is a sequence of digits. A sequence is a function.
This is not a large conceptual chasm. It is boilerplate for actually talking about decimal (or binary or hex or whatever) representations of numbers. Here is one version of that boilerplate, spelled out:
Think of a base-10 representation of a natural number as a function with domain N (the set of natural numbers) and codomain {0, ..., 9}, where f(0) is the ones digit, f(1) is the tens digit, f(2) is the hundreds digit and so on. (This function will be finitely supported, i.e. all but finitely many inputs to this function will give output 0.)
If f and g are the representations of two numbers n and m, then one can say the following about the representation of their product n * m:
(1) Extend the codomain of f and g, to N, i.e. think of them as functions N -> N instead of functions N -> {0, ..., 9}.
(2) Compute the convolution of those two functions, giving you another (still finitely supported) function h: N -> N. Usually at this point h will have values that are larger than 10.
(3) Do all the carrying (e.g. repeatedly take the first n for which h(n) > 10, and subtract 10 from h(n) and add 1 to h(n+1)). Now all the values of h are in {0, ..., 9}, so you can think of h as a function N -> {0, ..., 9}.
The resulting function h is the base-10 representation of the product of the two numbers that f and g represent.
The first few Google results for "sliding dot product" basically describe it as a convolution with the domain of one of the functions reversed. (Note as far as whether or not to reverse one of the functions, multiplication in base-b agrees with "convolution" and not with "sliding dot product".)
Elsewhere in the thread you seem to be suggesting that convolution is a way to convert between a frequency domain and a time domain, which suggests that you are confusing convolution with Fourier transforms. There's some nice relationships between these things, so they are often discussed together, but even then a single convolution happens either entirely in the time domain or entirely in the frequency domain, not as a way of going from one of those domains to the other. E.g. the pointwise product (in the frequency domain) of the Fourier transforms of two functions is the Fourier transform of the convolution (in the time domain) of those two functions.
There's also a chance you are focused on the difference between functions with a discrete domain and functions with a continuous domain. The term "convolution" is often applied to both.
It really isn't, and I don't understand the level of confusion that leads to this assertion.
Is it because the elementary school algorithm for multiplication decomposes a number in it's decimal places before multiplying?
> so why not start with that?
Because it would be wrong, misguided, and it would fail to demonstrate the logic behind convolution.
Let's take a step back and think about it. Convolution is mainly and primarily used as a convenient way to covert signals between time and frequency domain, within the scope of Fourier transforms. Who in their right mind associates elementary school multiplications with conversions to/from the frequency domain?
> Convolution is mainly and primarily used as a convenient way to covert signals between time and frequency domain
Btw that isn't true. That operation is called a "transform" as in "Fourier transform" and "z-transform". The property of those transforms is that element-wise multiplication in one domain becomes convolution in the transformed domain. Element-wise multiplication in "time" domain is convolution in "frequency" domain and element-wise multiplication in "frequency" domain is convolution in "time" domain.
Further to that, if you replace x with complex variable z^-1, you get what is usually called the z-transform. Set z to a specific complex root of unity exp(i2pi/N) and you have your discrete FT.
One of the algorithms for multiplying large arbitrary precision numbers uses multiplication of the discrete transform of the digit sequences (in a different base) iirc.
There's a brilliant (and free) book called The Scientist and Engineer's Guide to Digital Signal Processing which covers this and other DSP topics. It's very readable and clear - I found it super useful in understanding DSP and the maths surrounding it.
Most of the time the actual stuff happening is quite simple, but if you're not living and breathing mathematical notation the conventional explanations can be quite impenetrable. This book puts everything out in code, so you can follow the steps and truly understand what's happening.
Suppose f and g are normalized distributions and consider the function f(x)·g(y). If we "collapse" (integrate) in the y-dimension we recover f(x). If we collapse in the x-dimension we recover g(y).
But if we collapse along the lines x+y = v, we obtain the convolution f⋆g(v).
This picture is a little more advanced, but it makes clear two key properties of convolution: symmetry (commutativity) and the fact that the total integral of the convolution is the total integral of f(x)·g(y).
This generalizes to convolution in an arbitrary group (and even groupoid, and even category), not necessarily one that is commutative like the reals or integers are. For group convolution over a group G, the functions take values in the complex numbers and have domain the elements of G. The value of the convolution of two such functions f and g at a point c (a group element) is computed by taking the sums f(a)*g(b) where ab = c.
This has applications in quantum mechanics where non-commutativity plays a prominent role.
Woah this is really cool and an awesome point of view. Collapsing along the x-axis is just a projection along the vector [1, 0, 0] onto the the y-z plane defined by it and the origin. Then if you project along the vector [1, 1, 0] onto the plane defined by [1, -1, 0] and [0, 0, 1], you get the convolution.
This is really cool. The original post is great in putting it in terms of a concrete example. It would be cool (challenging but possible) to have a 3D visualization of collapsing f(x)·g(y) like you describe in the context of that same example.
Math entries in Wikipedia generally favor completeness and correctness over clarity, which makes them of limited use for many users.
In order to explain something simply, you usually have to lie a little bit — but I speculate that those little lies bother the contributors to mathematical Wikipedia articles a great deal. So they correct those little lies, making the articles more accurate but less useful for many of us.
In "Part 2: The Calculus Definition", f and g switch roles a couple times. Before the colorized formula, "f" is described as "the plan to use", and "g" as "the list of inputs"; and after the colorized formula, the plan is referred to as a "kernel"; but in the formula itself, "g" is the inputs and "f" is the kernel.
I realize convolution is commutative ("Part 3"), but it tripped me up a little as I tried to track the narrative.
My diffeq professor explained it as the fifth form of arithmetic (first four being addition, subtraction, multiplication, and division). This fifth form is unique because you sweep two functions relative to time and each other. To be honest, I still don't fully grasp the concept and I just use it as a mathematical tool. I need a 3blue1brown video to explain this to me so I can have an "aha!" moment like I had in his linear algebra videos.
I'm sure you've seen this elsewhere in the comments, but multiply two polynomials to see a discrete convolution in action! Here your functions would be discrete.
The first example is explained a bit too quickly, so it’s not that intuitive. It seems this hospital has one patient the first day, two on the second, three on the third, and so on. I guess that’s a “schedule” but it seems more like a history of what happened?
It’s not a “list of patients” either, which had me thinking they were patient ids. A better phrase would be a “list of patient counts”.
Convolution is in fact multiplication in Fourier space (this is the convolution theorem [1]) which says that Fourier transforms convert convolutions to products.
Okay. This is decent. I do like the idea of "fancy multiplications" because I do think you should understand convolution as well as you understand multiplication.
But I still feel like this kinda obscures and confuses the origin of the reversal.*
If you don't understand the convolution formula instinctively, read this comment enough times till you do.
The point is this.
-We have a function originBangSound(t) that maps the effect at (t) of an impulse or "bang" coming from the origin (t=0).
-We have a function bangWeights(t), which measures the distribution of impulses or "bangs" over time.
-The question is: how do we get the total allBangSounds(t)?
Simple: We make every point in time the origin, and add all the results together. Let's call (tau) the current origin. The size of the bang at this origin is bangWeights(tau). The size of the sound at (t) which is coming from this origin is originBangSound(t- tau), since we care about the position of (t) relative to the current origin (tau). Adding them up leads to an integral in the continuous case.
The point is this. Don't think of it as a flip. Think of (tau) as defining the origin point for a particular "bang".
Here's a nice sanity check: if (tau) is larger than (t), (or equivalently, (t-tau)<0) then do you expect to hear its bang? Ofcourse not. The bang hasn't happened yet. So unless its bang travels backward in time (which definitely does happen in spatial convolutions!) you ain't hearing it.
_____________________________
*Come to think of it, this is a very useful pun. When you think to yourself "what's the origin of the reversal again?", just remember, the moving the origin is the origin of the reversal.
And you flip it to get nice properties like commutativity and associativity which correlation doesn't have (it almost has them, just that small flip is missing).
Just because its funny, here is the obligatory reminder that deep convolutional networks are actually implemented as correlation, not convolution.
True, its just the lack of mirroring that differs, which is a linear operation and hence, the network can be considered too have learned the mirrored kernels, but it matters for initialization(bilinear init for conv^T sampling is still incorrectly mirrored in pytorch), and every time someone has put the images of the kernels in a paper its better than even odds they show the correlation, not convolution kernls.
This is great! Convolution over arrays is much easier for me to grasp intuitively, and the extension to functions over the reals is straightforward with that intuition.
So this is just discrete convolution but continuous time convolution AFAIK has no intuitive explanation.
The general idea is there is a sense of taking mathematical operations as having not just numbers and variables and outputing another number or variable... but that operations can take any number of anything and output any number of anything. The Fourier Transform takes in a function and outputs another function, which you can put meaning on it or you can just "nod and continue" (which is often easier instead of trying to put some words on abstract concepts).
Some "ideas" you can use to analyze the situation are concepts learned in things like linear algebra like abstract vector spaces, basis, eigen-whatever... and so on.
In the case of the continuous time convolution, it just turns out there is a magical thing associated with something called a Dirac Delta function with certain properties that are important for various things in controls engineering.
Kinda feels like a piece of turd that refuses to evacuate but I just stopped caring about "intuitive" explanations.
There's also a good one made by Grant Sanderson from 3blue1brown. This video was the first time I was introduced to the concept and had no trouble following it: https://www.youtube.com/watch?v=8rrHTtUzyZA
Convolution is such a core concept. In my era, it was a college sophomore/junior sort of thing.
Is it a standard high school topic for AP math types in high school nowadays? If not, why not? There are a few topics like that, for which I’d gladly give up high school teaching (or learning) L’Hôpital’s rule for.
I think it first comes up in signal processing courses at university (or nowadays in deep learning courses, which makes people think of convolution as some mysterious deep learning specific magic sauce buzzword).
I could not understand convolution in the beginning partly because the book sucked and my professor was derisive with bad middle eastern accent. I wish resources like these existed back in the 90's so we would not waste time learning stuff.
working with image processing makes convolution easily intuitive. you can get a "feel" of what the math is doing by looking what's happening with the images. while blurring is a common example (convolve a Gaussian function with the image), edge detection and other image operations are just a matter of changing the convolved signal (or kernel). in a simplified model, blurry photos due to poor lenses also share the same principle (the lens cannot form a perfect point, that imperfect point is effectively what is convolved). the same can be said for audio and other 1d data, but images are more visual.
The type signature of the outer product is not correct. We are mapping two functions to a function in the same domain.
Convolution is neither a valid example of an inner product or an outer product. No basic geometric operation on vectors has the correct type signature and axioms for convolution to be interpreted as a generalized "x". What we'd be looking for is a billinear mapping of vectors to vectors, and complex number style multiplication in R^2, or cross product multiplication in R^3 or R^7 are the only real candidates in that category unless we start interpreting functions as matrices.
Integrating over t can be visualized as sliding g over f over the range of t. Result is a function of x as t has been integrated out. It all made sense to me when I thought of it as sliding one function over another fixed function.
My own experience with academia has lead me to believe that there problems that are truly difficult and complex, but there are also problems which are described as difficult because so many people were exposed to the topic through a resource (typically another person) who had little experience (or interest) in effectively teaching the subject.