If you find something wrong mathematically or otherwise with something I write, then by all means let me know. So far you have found nothing. Details:
The Google statement I quoted was flatly wrong. It is really important for students to be told that.
I gave some references to more in statistics.
> the measure-theoretic issues
I didn't mention measure theory, and the statistics references I gave don't mention measure theory either. I referenced Fleming only as background on convexity, and there is no measure theory in that part of Fleming.
You mentioned the role of calculus for the math of regression: That use of calculus, for deriving the normal equations, is neither necessary nor, really, sufficient. There is another derivation, nicer, fully detailed mathematically, with no calculus at all. The core of the idea is that the minimization of the squared error has to be an orthogonal projection and, then, presto, bingo, get the normal equations. And there are more advantages to that derivation. To keep my post simple, I omitted that derivation, but a good treatment of regression without calculus could use it.
I omitted the standard definition of convexity, but apparently in this discussion we need that. By omitting the definition, the Google material was not so good.
Definition (convex function): For the set of real numbers R and a positive integer n, a function f: R^n --> R is convex provided for any u, v in R^n and any a in [0,1] we have
f(au + (1-a)v) <= af(u) + (1-a)f(v)
So, for a picture, on the graph of f(x), we have two points (maybe not distinct) (u, f(u)) and (v, f(v)). Then we draw the line between these two points. The number a determines where we are on that line. With a = 0, we are at point (v, f(v)). With a = 1, we are at point (u, f(u)). Then as we move a from 0 to 1, we move along that line. We also have on the graph the point, say,
P
(au + (1 - a)v, f( au + (1-a)v ) )
And on the line we drew, we have point, say,
Q
(au + (1 - a)v, af(u) + (1-a)f(v))
Well, we are asking that point Q be the same as point P or directly above point P. That is, the line we drew is on or above the graph of (x, f(x)). The line is sometimes called a secant line and is said to over estimate the function.
Definition (concave):
The function -f is concave if and only if the function f is convex.
As in Fleming, a convex function is continuous. Intuitively, the proof is based on two cones, and as we approach a point we get herded between the two cones. The cones are from the convexity assumption. Draw a picture.
IIRC, there is a result in Rockafellar that a convex function is differentiable almost surely with respect to Lebesgue measure, but this is the only connection I would make with convexity and measure theory.
For convex f, the set of all (x, y) where y >= f(x) is the epigraph of f, that is, the region on or above the graph of f.
Definition (convex set): A subset C of R^n is convex provided for any u, v in C and any a in [0,1] the point
au + (1-a)v
is also in set A.
Well, as in Fleming, the epigraph is convex. It is also closed in the usual topology of R^n.
Definition (closed set): A subset C of R^n is closed (in the usual topology of R^n) provided for any sequence x_n, n = 1, 2, ... in C that converges to y in R^n, y is also in C.
In particular, if we define the boundary of C, set C contains is boundary. So, the interval [0,1] is closed and the interval (0,1) is not closed.
Well, for any closed convex set and point u on its boundary, there exists a hyperplane that passes through point x and where set C is a subset of the closed half space on one side of the hyperplane.
Such a hyperplane is said to be supporting for set C at point x.
Intuitively, in R^3, push convex set C to be in contact with a wall. Suppose point x on the boundary of set C is in contact with the wall. Then the wall is a supporting plane for set C at x, and set C is a subset of the room side of the wall.
Or think of a big, solid, irregular piece of cheese and John Belushi as his Samurai Tailor swinging his sword: John keeps swinging his sword in arcs that are in flat planes and cuts down the irregular cheese to a convex hunk. So, the convex C has been determined (formed) from the supporting hyperplanes from Belushi's sword.
For another way to make a convex set, take a piece of wood and press it against a belt sander until the boundary consists of only flat sides.
Let a solid rock roll around in a steam of water for a few thousand years, and may end up with a smooth, shiny, convex rock.
Faster a chicken egg is convex.
In general, a closed convex set is the intersection of its supporting hyperplanes.
Then, we can approximate a convex set with some of its supporting hyperplanes. In particular, we can approximate a convex function with some supporting hyperplanes of its epigraph. At times, this can be useful -- it's the main idea behind Lagrangian relaxation in constrained optimization (I used that once).
In particular, the epigraph of a convex function is the intersection of the supporting hyperplanes. In that case, a supporting hyperplane is called a subgradient. The function is differentiable at the point of contact if and only if the subgradient is unique. If the subgradient is unique, then it is just the tangent hyperplane from the gradient of the function.
In R^3, a cube is a convex set. Each of its sides is part of a supporting hyperplane. For a point on the boundary of the cube that is not on an edge, the supporting hyperplane at that point is unique. The corners and edges of the cube also have supporting hyperplanes, but they are not unique.
In R^3 a sphere is convex. Then it is also the intersection of its supporting hyperplanes. At each point on the boundary of the sphere, the supporting hyperplane is unique.
Similarly for epigraphs.
So, the function f: R*n --> R where for each x f(x) = 0 is convex, concave, and linear, and each x is both a maximum and a minimum of f. So, the minimum of a convex function need not be unique. In this case, the epigraph is just a closed half space.
"Look, Ma! No calculus!" And no measure theory.
Exercise: Derive the regression normal equations via perpendicular projections and without calculus.
Exercise: Argue the role of perpendicular projections in the minimization in regression.
my only goal in mentioning the statistics sequence at all was to give a familiar example where the standard sequences vary in depth depending on audience. a trivial point, yes, but i wanted to be concrete because it's the internet. apparently that was a terrible choice, as it was far too close to the topic at hand; my apologies for making you search so hard for a connection.
i made my request because it's jarring for me as a reader when you punctuate your (often delightful) expository writing with conclusions about entire fields and large organizations that seem (on the face of it) to be justified by old and/or very limited data.
but that's a selfish request, and you are of course free to tell me to get lost and post whatever you want (and i'll still read it); i'm certainly not going to pursue this further, aside from the apology and clarifying comment above.
You do understand that for all the talk and new terminology and claims of "learning" in the "machine learning" (ML) in the Google OP, what is in the OP is a poor introduction to some highly polished material in "regression analysis" in 50 year old books. So, the ML stuff is adulterated old wine in new bottles with new labels. That is essentially intellectual theft and corruption, and without references essentially academic plagiarism. You should be offended.
If they are going to plagiarize, even just teach, regression analysis, then at least they shouldn't make a mess out of it, and a mess is what they made. Google should "get that MESS OFF the Internet".
Students should be told the truth: Regression is powerful stuff. Sometimes the results can be valuable. The Google OP is an introduction to regression and does have some value. But the Google material is a MESS, and students should be informed that they are getting really low quality material and should see some references to some beautifully polished material.
So, I helped any students who would be the target audience for the Google OP.
You should know this; I believe you do.
I'm offended by the mess and passing that out to students trying to learn. You should also be offended.
The Google statement I quoted was flatly wrong. It is really important for students to be told that.
I gave some references to more in statistics.
> the measure-theoretic issues
I didn't mention measure theory, and the statistics references I gave don't mention measure theory either. I referenced Fleming only as background on convexity, and there is no measure theory in that part of Fleming.
You mentioned the role of calculus for the math of regression: That use of calculus, for deriving the normal equations, is neither necessary nor, really, sufficient. There is another derivation, nicer, fully detailed mathematically, with no calculus at all. The core of the idea is that the minimization of the squared error has to be an orthogonal projection and, then, presto, bingo, get the normal equations. And there are more advantages to that derivation. To keep my post simple, I omitted that derivation, but a good treatment of regression without calculus could use it.
I omitted the standard definition of convexity, but apparently in this discussion we need that. By omitting the definition, the Google material was not so good.
Definition (convex function): For the set of real numbers R and a positive integer n, a function f: R^n --> R is convex provided for any u, v in R^n and any a in [0,1] we have
f(au + (1-a)v) <= af(u) + (1-a)f(v)
So, for a picture, on the graph of f(x), we have two points (maybe not distinct) (u, f(u)) and (v, f(v)). Then we draw the line between these two points. The number a determines where we are on that line. With a = 0, we are at point (v, f(v)). With a = 1, we are at point (u, f(u)). Then as we move a from 0 to 1, we move along that line. We also have on the graph the point, say, P
(au + (1 - a)v, f( au + (1-a)v ) )
And on the line we drew, we have point, say, Q
(au + (1 - a)v, af(u) + (1-a)f(v))
Well, we are asking that point Q be the same as point P or directly above point P. That is, the line we drew is on or above the graph of (x, f(x)). The line is sometimes called a secant line and is said to over estimate the function.
Definition (concave): The function -f is concave if and only if the function f is convex.
As in Fleming, a convex function is continuous. Intuitively, the proof is based on two cones, and as we approach a point we get herded between the two cones. The cones are from the convexity assumption. Draw a picture.
IIRC, there is a result in Rockafellar that a convex function is differentiable almost surely with respect to Lebesgue measure, but this is the only connection I would make with convexity and measure theory.
For convex f, the set of all (x, y) where y >= f(x) is the epigraph of f, that is, the region on or above the graph of f.
Definition (convex set): A subset C of R^n is convex provided for any u, v in C and any a in [0,1] the point
au + (1-a)v
is also in set A.
Well, as in Fleming, the epigraph is convex. It is also closed in the usual topology of R^n.
Definition (closed set): A subset C of R^n is closed (in the usual topology of R^n) provided for any sequence x_n, n = 1, 2, ... in C that converges to y in R^n, y is also in C.
In particular, if we define the boundary of C, set C contains is boundary. So, the interval [0,1] is closed and the interval (0,1) is not closed.
Well, for any closed convex set and point u on its boundary, there exists a hyperplane that passes through point x and where set C is a subset of the closed half space on one side of the hyperplane.
Such a hyperplane is said to be supporting for set C at point x.
Intuitively, in R^3, push convex set C to be in contact with a wall. Suppose point x on the boundary of set C is in contact with the wall. Then the wall is a supporting plane for set C at x, and set C is a subset of the room side of the wall.
Or think of a big, solid, irregular piece of cheese and John Belushi as his Samurai Tailor swinging his sword: John keeps swinging his sword in arcs that are in flat planes and cuts down the irregular cheese to a convex hunk. So, the convex C has been determined (formed) from the supporting hyperplanes from Belushi's sword.
For another way to make a convex set, take a piece of wood and press it against a belt sander until the boundary consists of only flat sides.
Let a solid rock roll around in a steam of water for a few thousand years, and may end up with a smooth, shiny, convex rock.
Faster a chicken egg is convex.
In general, a closed convex set is the intersection of its supporting hyperplanes.
Then, we can approximate a convex set with some of its supporting hyperplanes. In particular, we can approximate a convex function with some supporting hyperplanes of its epigraph. At times, this can be useful -- it's the main idea behind Lagrangian relaxation in constrained optimization (I used that once).
In particular, the epigraph of a convex function is the intersection of the supporting hyperplanes. In that case, a supporting hyperplane is called a subgradient. The function is differentiable at the point of contact if and only if the subgradient is unique. If the subgradient is unique, then it is just the tangent hyperplane from the gradient of the function.
In R^3, a cube is a convex set. Each of its sides is part of a supporting hyperplane. For a point on the boundary of the cube that is not on an edge, the supporting hyperplane at that point is unique. The corners and edges of the cube also have supporting hyperplanes, but they are not unique.
In R^3 a sphere is convex. Then it is also the intersection of its supporting hyperplanes. At each point on the boundary of the sphere, the supporting hyperplane is unique.
Similarly for epigraphs.
So, the function f: R*n --> R where for each x f(x) = 0 is convex, concave, and linear, and each x is both a maximum and a minimum of f. So, the minimum of a convex function need not be unique. In this case, the epigraph is just a closed half space.
"Look, Ma! No calculus!" And no measure theory.
Exercise: Derive the regression normal equations via perpendicular projections and without calculus.
Exercise: Argue the role of perpendicular projections in the minimization in regression.