Hacker Newsnew | past | comments | ask | show | jobs | submit | laichzeit0's favoriteslogin

I have read parts of it years ago. As far as I remember, this is very theoretical (lots of statistical learning theory, including some IMHO mistaken treatment of Vapnik's theory of structural risk minimization), with strong focus on theory and basicasically zero focus on applications. Which would be completely outdated by now anyway, as the book is from 2014, an eternity in AI.

I don't think many people will want to read it today. As far as I know, mathematical theories like SLT have been of little use for the invention of transformers or for explaining why neural networks don't overfit despite large VC dimension.

Edit: I think the title "From theory to machine learning" sums up what was wrong with this theory-first approach. Basically, people with interest in math but with no interest in software engineering got interested in ML and invented various abstract "learning theories", e.g. statistical learning theory (SLT). Which had very little to do with what you can do in practice. Meanwhile, engineers ignored those theories and got their hands dirty on actual neural network implementations while trying to figure out how their performance can be improved, which led to things like CNNs and later transformers.

I remember Vapnik (the V in VC dimension) complaining in the preface to one of his books about the prevalent (alleged) extremism of focussing on practice only while ignoring all those beautiful math theories. As far as I know, it has now turned out that these theories just were far too weak to explain the actual complexity of approaches that do work in practice. It has clearly turned out that machine learning is a branch of engineering, not a branch of mathematics or theoretical computer science.

The title of this book encapsulates the mistaken hope that first people will learn those abstract learning theories, they get inspired, and promptly invent new algorithms. But that's not what happened. SLT is barely able to model supervised learning, let alone reinforcement learning or self-supervised learning. As I mentioned, they can't even explain why neural networks are robust to overfitting. Other learning theories (like computational/algorithmic learning theory, or fantasy stuff like Solomonoff induction / Kolmogorov complexity) are even more detached from reality.


The Conformal Prediction advocates (especially a certain prominent Twitter account) tend to rehash old frequentist-vs-bayesian arguments with more heated rhetoric than strictly necessary. That fight has been going on for almost a century now. Bayesian counterargument (in caricature form) would be that MLE frequentists just choose an arbitrary (flat) prior, and penalty hyperparameters (common in NN) are a de facto prior. The formal guarantees only have bite in the asymptotic setting or require convoluted statements about probabilities over repeated experiments; and asymptotically, the choice of prior doesn't matter anyway.

(I'm a moderate that uses both approaches, seeing them as part of a general hierarchical modeling method, which means I get mocked by either side for lack of purity).

Bayesians are losing ground at the moment because their computational methods haven't been advanced as fast by the GPU revolution for reasons having to do with difficulty in parallelization, but there's serious practical work (especially using JAX) to catch up, and the whole normalizing flow literature might just get us past the limitations of MCMC for hard problems.

But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.

Calibration is also a pretty magical way to improve just about any estimator. It's cheap to do and it works (although hard to guarantee anything with that in the general case...)

And don't forget quantile regression penalties! Awkward to apply in the NN setting, but an easy and effective way to do UQ in XGBoost world.


i completely understand your point-of-view. tough to give a silver-bullet answer because things move so quickly.

i personally found this course to be a good place for deep learning (again not a survey course that covers classical ML for context) - https://uvadlc-notebooks.readthedocs.io/en/latest/index.html

however, the strategy that should give good results is going for the open-source coursework from one of the major universities. these courses may lag a semester or two from SOTA, but they often give a good overview. pick your poison and go from there. for example the above link was found the same way.


A few things I wish I knew when took Statistics courses at university some 25 or so years ago:

- Statistical significance testing and hypothesis testing are two completely different approaches with different philosophies behind them developed by different groups of people that kinda do the same thing but not quite and textbooks tend to completely blur this distinction out.

- The above approaches were developed in the early 1900s in the context of farms and breweries where 3 things were true - 1) data was extremely limited, often there were only 5 or 6 data points available, 2) there were no electronic computers, so computation was limited to pen and paper and slide rules, and 3) the cost in terms of time and money of running experiments (e.g., planting a crop differently and waiting for harvest) were enormous.

- The majority of classical statistics was focused on two simple questions - 1) what can I reliably say about a population based on a sample taken from it and 2) what can I reliably about the differences between two populations based on the samples taken from each? That's it. An enormous mathematical apparatus was built around answering those two questions in the context of the limitations in point #2.


(Former AI researcher + current technical founder here)

I assume you’re talking about the latest advances and not just regression and PAC learning fundamentals. I don’t recommend following a linear path - there’s too many rabbit holes. Do 2 things - a course and a small course project. Keep it time bound and aim to finish no matter what. Do not dabble outside of this for a few weeks :)

Then find an interesting area of research, find their github and run that code. Find a way to improve it and/or use it in an app

Some ideas.

- do the fast.ai course (https://www.fast.ai/)

- read karpathy’s blog posts about how transformers/llms work (https://lilianweng.github.io/posts/2023-01-27-the-transforme... for an update)

- stanford cs231n on vision basics(https://cs231n.github.io/)

- cs234 language models (https://stanford-cs324.github.io/winter2022/)

Now, find a project you’d like to do.

eg: https://dangeng.github.io/visual_anagrams/

or any of the ones that are posted to hn every day.

(posted on phone in transit, excuse typos/formatting)


With over-parameterized neural networks, the problem essentially becomes convex and even linear [1], and in many contexts provably converges to a global minimum [2], [3].

The question then becomes: why does this generalize [4], given that the classical theory of Vapnik and others [5] becomes vacuous, no longer guaranteeing lack of over-fitting?

This is less well understood, although there is recent theoretical work here too.

[1] Lee et al (2019). Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a3...

[2] Allen-Zhu et al (2019). A convergence theory for deep learning via over-parameterization. https://proceedings.mlr.press/v97/allen-zhu19a.html

[3] Du et al (2019). Gradient Descent Finds Global Minima of Deep Neural Networks. http://proceedings.mlr.press/v97/du19c.html

[4] Zhang et al (2016). Understanding deep learning requires rethinking generalization.

[5] Vapnik (1999). The nature of statistical learning theory. https://arxiv.org/abs/1611.03530


There are areas of control theory where you can learn the dynamics ("adaptive control"). The advantage over RL is that in control theory, you generally assume the dynamics are described by differential equations (sometimes difference equations), not by Markov decision processes. MDPs are more general, but basically any physical mechanism you're going to control doesn't need that generality.

There is a surprising amount of structure imposed by the assumption that the dynamics are differential equations, even if you don't know what the differential equations look like. As a consequence, adaptive control laws generally converge a lot faster (like, orders of magnitude faster) than MDP-based RL approaches on the same system being controlled.

The other advantage is that you can prove stability and in some cases have an idea of your performance margin with control theory. THis is important if you eg want your system to receive any sort of accreditation or if you want to fit it into the systems engineering of a more complex system. There's a reason autopilots don't use RL, and it isn't that RL can't be made to work. It's that you can't rigorously prove how robust the RL policy is to changes in the airplane dynamics.


There's LKH http://webhotel4.ruc.dk/~keld/research/LKH/ which is heuristics and best open implementation. Adding optimality estimates is the least complicated part.

When TSP is mentioned today, unlike 50 years ago when LK heuristic got published, I assume all of the popular & practical variants, like time window constraints, pickup and delivery, capacity constraints, max drop time requirement after pickup, flexible route start, adding location independent breaks (break can happen anytime in the sequence or in a particular time window of day) etc. Some of the subproblems are so constrained that you cannot even move around that effectively as you can with raw TSP.

Some of the subproblems have O(n) or O(n log n) evaluations of best local moves, generic solvers are even worse at handling that (Concorde LP optimizations cannot cover that efficiently). When no moves are possible, you have to see what moves brings you back to a feasible solution and how many local changes you need to do to accomplish this.

For example, just adding time windows complicates or makes most well known TSP heuristics useless. Now imagine if we add a requirement between pairs of locations that they need to be at most X time apart (picking up and then delivering perishable goods), that the route can start at an arbitrary moment etc.

I personally spent quite a lot of time working on these algorithms and I'd say the biggest issue is instance representation (is it enough to have a sequence of location ids ?). For example, one of my recent experiments was using zero suppressed binary decision diagrams to easily traverse some of these constrained neighborhoods and maintain the invariants after doing local changes. Still too slow for some instances I handle (real world is 5000 locations, 100 salesmen and an insane amount of location/salesmen constraints).


For clarification, Murphy's first book is just Machine Learning: A probabilistic perspective this is his newest, 2 volume book, Probabilistic Machine Learning which is broken down into two parts an Introduction (published March 1, 2022) and Advanced Topics (expected to be published in 2023, but draft preview available now).

To answer your question. This book is even more complete and a bit improved over the first book. I don't believe there's anything in Machine Learning that isn't well covered, or correctly omitted from Probabilistic Machine Learning. This also has the benefit of a few more years of rethinking these topics. So between the existing Murphy books, Probabilistic Machine Learning: an Introduction is probably the one you should have.

Why this over Bishop (which I'm not sure is the case)? While on the surface they are very similar (very mathematical overviews of ML from a very probability focused perspective) they function as very different books. Murphy is much more of a reference to contemporary ML. If you want to understand how most leading researchers think about and understand ML, and want a reference covering the mathematical underpinnings this is a book you really need for a reference.

Bishop is a much more opinionated book in that Bishop isn't just listing out all possible ways of thinking about a problem, but really building out a specific view of how probability relates to machine learning. If I'm going to sit down and read a book, it's going to be Bishop because he has a much stronger voice as an author and thinker. However Bishop's book is now more than 10 years old an misses out on nearly all of the major progress we've seen in deep learning. That's a lot to be missing and it won't be rectified in Bishop's perpetual WIP book [0.]

A better comparison is not Murphy to Murphy or Murphy to Bishop, but Murphy to Hastie et al. The Elements of Statistical Learning for many years was the standard reference for advanced ML stuff, especially during the brief time when GBDT and Random Forests where the hot thing (which they still are to an extent in some communities). I really enjoy EoSL but it does have a very "Stanford Statistics" (which I feel is even more aggressively Frequentist than your average Frequentist) feel to the intuitions. Murphy is really the contemporary computer science/Bayesian understanding of ML that has dominated the top research teams for the last few years. It feels much more modern and should be the replacement reference text for most people.

0. https://www.mbmlbook.com/


Bourbaki student M. Talagrand has some work on approximate independence. If I were trying to do something along the lines of Probabilistic Machine Learning: Advanced Topics I would look

(1) carefully at the now classic

L. Breiman, et al., Classification and Regression Trees (CART),

and

(2) at the classic Markov limiting results, e.g., as in

E. Çinlar, Introduction to Stochastic Processes,

at least to be sure are not missing something relevant and powerful,

(3) at some of the work on sufficient statistics, of course, first via the classic Halmos and Savage paper and then at the interesting more recent work in

Robert J. Serfling, Approximation Theorems of Mathematical Statistics,

and then for the most promising

(4) very carefully at Talagrand.

(1) and (2) are old but a careful look along with more recent work may yield some directions for progress.

What Serfling develops is a bit amazing.

Then don't expect the Talagrand material to be trivial.


The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.

This looks like a fantastic resource. Thanks for sharing!

I really enjoy the Bayesian side of ML, but it's definitely not the most accessible. Erik Bernhardsson cites latent dirichlet allocation as a big inspiration behind the music recommendation system he originally designed for Spotify, which is apparently still in use today[1]. I still struggle with grokking latent factor models, but it can be so rewarding to build your own and watch it work (even with only moderate success!).

Kevin Murphy has been working on a new edition of MLaPP that is now two volumes, with the last volume on advanced topics slated for release next year. However, both the old edition and the drafts for the new edition are available on his website here[2].

The University of Tübingen has a course on probabilistic ml which probably has one of the most thorough walkthroughs of a latent factor model I've found on the Internet. You can find the full playlist of lectures for free here on YouTube[3].

In terms of other resources for deep study on fastinating topics which require some command over stats and probability:

- David Silverman's lectures on reinforcement learning are fantastic [4]

- The Machine Learning Summer School lectures are often quite good, with exceptionally talented researchers / practictioners being invited to provide multi-hour lectures on their domain of expertise with the intended audience being a bunch of graduate students with intermediate backgrounds in general ML topics. [5]

1: https://www.slideshare.net/erikbern/music-recommendations-ml... 2: https://probml.github.io/pml-book/ 3: https://www.youtube.com/playlist?list=PL05umP7R6ij1tHaOFY96m... 4: https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPe... 5: http://mlss.cc


Given that PCA is heavily antiquated these days, I'd say that asking your candidates to know algebraic topology (the basis behind many much more effective non linear DR algorithms like UMAP) is far better... But in spite of the field having long ago advanced beyond PCA, you're still using it to gatekeep.

Richard McElreath's content is a breath of fresh air for anyone who's struggled with stats.

I read both editions of his textbook and will be revisiting this new lecture material soon. I highly recommend you check out his book/course if you've been frustrated with trying to learn stats and want to practically understand things without hundreds of pages of proofs.

https://xcelab.net/rm/statistical-rethinking/


> Do people regularly run into coworkers like me during their career and simply ignore it because they find it too awkward to criticize them? Have I just been incredibly lucky and every boss I have had is too incompetent to notice? Do I have imposter syndrome and I am actually a 10x developer whose laziness makes them a 1x developer?

Lazy developers don't really bother me, if you do a couple hours of high-quality work a week I'd have no complaint. (Many weeks I do as little, some weeks I do a decent amount of real work :) The problem is developers making negative progress, usually messes that need to be cleaned up ... and it's an awkward situation, no matter your relative authority. I'm in the "we should consider them lines spent, not lines produced" camp.

An old disputed quote:

> I divide my officers into four classes as follows: The clever, the industrious, the lazy, and the stupid. Each officer always possesses two of these qualities.

> Those who are clever and industrious I appoint to the General Staff. Use can under certain circumstances be made of those who are stupid and lazy. The man who is clever and lazy qualifies for the highest leadership posts. He has the requisite nerves and the mental clarity for difficult decisions. But whoever is stupid and industrious must be got rid of, for he is too dangerous.


I scrap government sites a lot as they don't provide apis. For mobile proxies, I use the proxidize dongles and mobinet.io (free, with Android devices). As stated in the article, with cgNAT it's basically impossible to block them as in my case, half the country couldn't access the sites anymore (if you place them in several locations and use one carrier each there).

I've given a few textbook suggestions for almost all of the topics you requested, in a preferred order for learning them. But before you look at that list, consider the following:

I would strongly, strongly advise against trying to learn proof-based mathematics from a textbook (almost all of the math here will be proof-based). The absolute best way to learn mathematics is to have an experienced and competent instructor tailor their pedagogy to you. Failing that, an experienced instructor who is "just okay" but who can e.g. review and critique your work is better than a textbook.

Learning math is very unlike learning programming. It's a counterintuitive idea, but the information density of math textbooks (whether they're well or poorly written) is generally so high that you can't absorb the material unless you read only a few pages per day. Not only that, but it's usually not the case that a single textbook has the ideal level of exposition for your needs - for example, you don't have linear algebra on here despite it being a prerequisite for basically everything else. Some textbooks treat this subject in a highly theoretical manner, while others treat it at a very applied/computational level. Which suits your needs more? Have you studied it at all?

If you're actually serious about this, you need to proceed at a slow pace (2 - 5 pages per day) and complete as many exercises as possible. If the exercises are computationally focused you can do fewer, but you should aim to solve as many of the proof-based problems as possible.

If you go at a rate which will actually allow you to absorb the material, doing this "properly" will take you years. With dedication and not much talent I'd expect it to take as long as an undergraduate degree. With dedication and a lot of talent I could see this being accomplished in two, maybe three years. Once again, I strongly, strongly suggest finding a mentor or instructor.

In any case, here is a list of the textbooks most mathematicians will consider to be very good:

1. Calculus

Calculus, by Spivak

This gives you a rigorous treatment of calculus, which hopefully you have some familiarity with. After this you can move on to real analysis.

2. Real Analysis

Principles of Mathematical Analysis, by Rudin

You might be ready for this after Spivak's Calculus, but it can be rough. If you can't reproduce a proof of irrationality after reading through the first few pages, work through Tao's Analysis I first.

3. Topology

Topology, by Munkres is the absolute gold standard. You should be comfortable with calculus (and hopefully analysis) before tackling this.

4. Linear Algebra

Linear Algebra Done Right, by Axler

This is a thorough introduction to the subject at a theoretical level, with a focus on finite-dimensional vector spaces over fields R and C.

You should also work through either Linear Algebra by Friedberg, Insel, Spence or Linear Algebra by Hoffman & Kunze for the treatment of more advanced/specialized material and, in particular, determinants (which are notably de-emphasized by Axler).

Noam Elkies uses Axler for Harvard's Math 55 and has written up notes and remarks for his students; be sure to read them: http://www.math.harvard.edu/~elkies/M55a.16/index.html

5. Abstract Algebra (Groups, Rings, etc)

Abstract Algebra by Dummit & Foote is the usual reference text for a first course. It's pretty good. If it's too advanced for you, try Pinter's A Book of Abstract Algebra. For a very challenging (but comprehensive) approach to the subject, try Lange's Algebra.

6. Category Theory

Once you have abstract algebra under your belt, a good introduction to category theory is given by Aluffi's Algebra: Chapter 0. I would suggest not trying to dive into this prior to at least encountering fields, groups and rings because it's good to have both the traditional and modern (read: categorical) contexts.

Also try Category Theory in Context, by Riehl.

7. Complex Analysis

Complex Analysis, by Ahlfors. This is an excellent and concise text. You can theoretically approach this before real analysis, but I wouldn't recommend that. Also try Complex Variables, by Churchill & Brown.

8. Differential Geometry

Calculus on Manifolds by Spivak. You will want to have a thorough understanding of analysis and linear algebra before approaching this material.

9. Measure Theory

This is very advanced material in an analysis sequence; don't jump to this unless you've thoroughly worked through analysis first.

I would recommend Stein & Shakarchi's Real Analysis: Measure Theory, Integration and Hilbert Spaces.

10. Probability Theory

A really rigorous treatment of probability is measure theoretic, but even if you haven't worked with measures before you'll need (real) analysis and linear algebra. Tackle those first.

Feller's Introduction to Probability Theory is usually a good first course. If you don't like that, try Ross. For truly advanced probability theory, work through Shiryaev or Kallenberg.

The other things you've asked for are a little under-specified or outside my wheelhouse (in particular, I don't think chaos theory is still emphasized as a field distinct from dynamical systems). You should probably add ordinary and partial differential equations to your list before some of these more specialized topics.

1. Numerical Analysis

Numerical Linear Algebra, by Trefethen & Bau. This is the best all-around introduction. Once you've worked through this, try moving on to Matrix Computations by Golub & van Loan. The latter is much more of a reference text.

2. Cryptography

You haven't specified what you're looking for here, but given the mathematical bent of your question I'd recommend Goldreich's Foundations of Cryptography (two volumes). Be forewarned: cryptography is a subfield of complexity theory. You should have a strong understanding of complexity theory before embarking on Goldreich's Foundations.

If you really want to challenge yourself theoretically, work through Galbraith's Mathematics of Public Key Cryptography. The most up to date version is available for free: https://www.math.auckland.ac.nz/~sgal018/crypto-book/crypto-...

On the other hand, if you're looking for a more implementation-focused text on cryptography, try Menezes' Handbook of Applied Cryptography.

3. Optimization

This is extremely broad. There's linear programming, mixed integer programming, nonlinear optimization, stochastic optimization...I can't recommend textbooks targeted at everything here.

For a good start to the subject of optimization and constraints in general, work through Boyd & Vanderberghe's Convex Optimization. There are additional exercises available from the authors here: https://web.stanford.edu/%7Eboyd/cvxbook/bv_cvxbook_extra_ex...


At university, I generated Markov chains of the solution space from a single neuron that was being used as a binary classifier. You take n samples, average them out and look at the decision boundary. The decision boundary itself is linear but the margin of error is not.

It was really cool. Attempting to implent Hamiltonian MCMC on a single neuron really forced you to learn what a gradient is in regards to NN.


I love how the era of Zoom lawmaking has killed decorum for so many of these state public functions. The pompous elected officials now have to connect from what often look like very pedestrian suburban homes, with either bare walls or run-of-the-mill family pics on them, bad lighting, cameras too close to them and too high or too low... They no longer project the institutional authority of big, carpeted, wainscoted, pillared halls of power, and it's just as well: I like that they should think of themselves as mere managers of public administration, not historical figures destined to leave grandiose legacies...

What poetry replaces is pip and requirements.txt. Our team had a lively discussion about this, but here are some good reasons:

* Keep test frameworks out of production (in case they've busted something; the two points below can happen to test libraries too.)

* pip recently changed the way it's resolver works[1], breaking numerous projects. Yeah yeah, it's a major version bump, but lots of containers and environments installed pip latest just sort of assuming that it would always handle requirements.txt the same way. With that contract broken, now using pip directly isn't a non-decision anymore.

* Locking specific versions of dependencies can you roll back Bad News in production. Let's say you have a project with library A which has dependency B. If A asks for the latest version of B, you can be in a situation where a new version of B breaks your project and you won't know about it until you do a release _and_ if you try to roll back you'll still be in trouble. We recently had to deal with a similar issue and had to fall back on an older container until we could figure it out.

[1] https://pyfound.blogspot.com/2020/11/pip-20-3-new-resolver.h...


Sad to see both LiveLeak and Best Gore gone. Not that I visited either much, but they really opened my eyes to the horrors of reality.

What are alternatives to LiveLeak or Best Gore?


No, that was the correct response.

If anything, you should have fired him intentionally yourself when he went over your head. He isn't looking to work with you or the team, he's actively seeking to undermine you; at that point his presence is a negative, not a positive.


You can square it with some basic philosophy. Hume's Guillotine - you can't derive an ought statement from is statements alone. Munchhausen Trilemma - you can only answer "why" with circular reasoning, infinite regress, or axioms - axioms being the only reasonable option of the three.

Throw in some other logic, and you end up with the conclusion that the world can still be logical, but that many normative conclusions (which is what the world runs on) depend on moral axioms that differ from person to person.

So there's a distinction between people that reason correctly from their moral axioms, and people who don't. And among the ones who don't, there's a distinction between those that just make a mistake, and the ones that are deliberately presenting their arguments in bad faith. But even the latter have their own internal conclusions that are honest attempts at rational derivations from their internal values.


I'm currently using a non-linear version of it, DeepSURV [*], implemented in pycox for a predictive maintenance job. They are much more informative than a binary label and give space to make a business decicision as to how close to EOL you want to take care of the asset.

[*] The underlying neural networks aren't deep at all.


Yes. Beamer LaTeX with the Metropolis theme and Fira Sans as the main font.

For control theory from a mathematician, Fleming, long at the Division of Applied Mathematics at Brown University, there is:

Wendell H. Fleming and Raymond W. Rishel, Deterministic and Stochastic Optimal Control, ISBN 0-387-90155-8, Springer-Verlag, Berlin, 1979.

There is more on optimization, especially from the Hahn-Banach theorem, and also Kalman filtering, in

David G. Luenberger, Optimization by Vector Space Methods, John Wiley and Sons, Inc., New York, 1969.


Not quite finished yet but coming soon: Speech & Language Processing (3rd ed.) https://web.stanford.edu/~jurafsky/slp3/

Reggie Foster did have a textbook come out before his death (unlike Ørberg's, which you mention, it is written in English rather than Latin), Ossa Latinitatis Sola.

https://thelatinlanguage.org/ossa/

He also had a sequel (about reading Cicero) in press which is due out in January. It's called Ossium Carnes Multae. Daniel McCarthy, the editor, has been collecting materials by Foster with the aim of bringing out a five-part series:

https://thelatinlanguage.org/latinitatis-corpus/

I'm not sure whether volumes III through V are ever going to appear. :-(


Both Bayesian inference and deep learning can do function fitting, i.e. given a number of observations y and explanatory variables x, you try to find a function so that y ~ f(x). The function f can have few parameters (e.g. f(x)= ax+b for linear regression) or millions of parameters (the usual case for deep learning). You can try to find the best value for each of these parameters, or admit that each parameter has some uncertainty and try to infer a distribution for it. The first approach uses optimization, and in the last decade, that's done via various flavors of gradient descent. The second uses Monte Carlo. When you have few parameters, gradient descent is smoking fast. Above a number of parameters (which is surprisingly small, let's say about 100), gradient descent fails to converge to the optimum, but in many cases gets to a place that is "good enough". Good enough to make the practical applications useful. In pretty much all cases though, Bayesian inference via MCMC is painfully slow compared to gradient descent.

But there is a case where it makes sense: when you have reasonably few parameters, and you can understand their meaning. And this is exactly the case of what's called "statistical models". That's why STAN is called a statistical modeling language.

How is that? Gradient descent for these small'ish models is just MLE (maximum likelihood estimation). People have been doing MLE for 100 years, and they understand the ins and outs of MLE. There are some models that are simply unsuited for MLE; their likelihood function is called "singular"; there are places where the likelihood becomes infinite despite the fit being quite poor. One way to fix that is to "regularize" the problem, i.e. to add some artificial penalty that does not allow the reward function to become infinite. But this regularization is often subjective. You never know when the penalty you add is small enough to not alter the final fit. Another way is to do Bayesian inference . It's very slow, but you don't get pulled towards the singular parameters.


Yes, as in the OP, S curves can be challenging to work with, in particular, to use, say, early in the history of smart phones, to make long term projections from the number of smart phones sold each day for each of the last 30 days.

But there is some good news: The data used can vary, and in some cases good projections can be easier to make. Can see, e.g., for COVID-19 the recent

https://news.ycombinator.com/item?id=22898015

https://news.ycombinator.com/item?id=22897967

https://news.ycombinator.com/item?id=22900104

https://news.ycombinator.com/item?id=22902667

In the third one of those we have that the projection from a FedEx case is the solution to the first order ordinary differential equation initial value problem

y'(t) = k y(t) (b - y(t))

There for data we used y(0) and b. Then we guessed at k. Had we used values of y for the past month, we could have picked a better, likely fairly good, value for k.

Lesson: Fitting an S curve does not have to be terribly bad.

The key here is the b: The S curve of the solution is the logistic curve, and it rises to be asymptotic to b from below. Knowing b helps a LOT! When have b, are no longer doing a projection or extrapolation but nearly just an interpolation -- much better.

For FedEx, the b was the capacity of the fleet. For COVID-19 the b would be the population needed for herd immunity (from recovering from the virus, from therapeutics that confer immunity, and a vaccine that confers immunity).

Knowing b makes the fitting much easier/better. To know b, likely need to look at the real situation, e.g., population of candidate smart phone users, candidate TV set owners, market potential of FedEx (as it was planned at the time), or population needed for herd immunity for the people in some relatively isolated geographic area.

Then in TeX source code, the solution is

y(t) = { y(0) b e^{bkt} \over y(0) \big ( e^{bkt} - 1 \big ) + b}

Can also use a continuous time discrete state space Markov process subordinated to a Poisson process. Here's how that works:

Have some states, right, they are discrete. For FedEx, that would be (i) the number of customers talking about the service and (ii) the number of target customers listening. Then the time to the next customer is much like the time to the next click of a Geiger counter, that is, has exponential distribution, that is, is the time of the next arrival in a Poisson arrival process (e.g., the time of the next arrival at the Google Web site). So at this arrival, the process moves to a new state where we have 1 more current customer and 1 less target customer. Then start again to get the next new customer.

The Markov assumption is that the past and future of the process are conditionally independent given the present state; so that justifies our getting to the next state using only the current state -- given the current state, for predicting the future, everything before that is irrelevant.

What is a Markov process, what satisfies that the Markov assumption, can depend on what we select for the state -- roughly the more we have in the state, the closer we are to Markov. In particular, if we take the whole past history of the process as the state, IIRC every process is Markov. But Markov helps in something like the FedEx application since that state is so simple.

We get to use continuous time since the time to the next change of state is from a Poisson process whose arrival times are the continuum -- that is, we don't have to make time discrete although it is true that the history of the process (one sample path) has state changes only at discrete times.

So, for state change, and for some positive integer n we have some n possible states, then for i, j = 1, 2, ..., n, we can have some p(i,j) which is the probability of jumping from state i to state j, that is, we have an n x n matrix of transition probabilities.

[p(i,j) is the conditional probability of entering state j given that the last state was i.]

For two jumps, square that matrix. Now there is a lot of pretty math -- get some limits and eigenvectors of states, etc. Actually fairly generally there is a closed form solution to the process. Alas, often in practice that closed form is useless because the n and the n x n are so large, maybe n^2 in the trillions. E.g., in a problem I solved for war at sea, there were Red weapons, Blue weapons, on each side some number types and some number of weapons of that type. The states were the combinatorial explosion. Then there were the one on one Red-Blue encounters where one died, the other died, both died, or neither died. The time to an encounter was the next arrival of Poisson processes, also Poisson. Well, that was an example where there was a closed form solution but n and n x n were wildly too large for the closed form solution but running off, say, 500 sample paths via Monte-Carlo was easy to program and fast for the computer. So, sure the software reported the average of the 500 sample paths. On a PC today, my software would be done before could get finger off the mouse button or the Enter key.

This approach is fairly general. And since what I did included attack submarines, SSBN submarines, anti-submarine destroyer ships, long range airplanes, etc., there should be no difficulty building such a model for COVID-19 that included babies, grade school kids, ..., nursing home residents, people at home, people working nearly alone on farms, ....

Back to S curves, IIRC dropping out of the math for the n x n matrix and its powers is an S curve. So, in a broad range of cases, always get an S curve although a different curve depending on, yes, the p(i,j) and the initial state. Uh, when no one is left sick, the Markov process handles that as an absorbing state -- once get there, don't leave.

For the n in the billions, the n x n is really a biggie. So, for the submarine problem I did,

J. Keilson, Green's Function Methods in Probability Theory.

asked "How can you possibly fathom that enormous state space?". That is a good question, and my answer was: "After, say, 5 days, the number of SSBNs left is a random variable. It is bounded. So it has finite variance. So, both the strong and weak laws of large numbers apply. So, run off 500 sample paths, average them, and get the expectation within a gnat's ass nearly all the time. Intuitively, Monte Carlo puts the effort where the action is.". Keilson was offended by "gnat's ass" but liked the math and approved my work for the US Navy. That question and answer are good to keep in mind.

There is more in, say,

Erhan Çinlar, Introduction to Stochastic Processes, ISBN 0-13-498089-1, Prentice-Hall, Englewood Cliffs, NJ, 1975.

For why the arrival times have exponential distribution and why we get a Poisson process, Çinlar has a nice simple, intuitive, useful axiomatic derivation. There is more via the renewal theorem in

William Feller, An Introduction to Probability Theory and Its Applications, Second Edition, Volume II, ISBN 0-471-25709-5, John Wiley & Sons, New York, 1971.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: