More

datastoat · 2025-06-22T14:17:55 1750601875

Author: "5% chance of shipping something that only looked good by chance". One philosophy of statistics says that the product either is better or isn't better, and that it's meaningless to attach a probability to facts, which the author seems to be doing with the phrase "5% chance of shipping something".

Parent: "5% chance of looking as good as it did, if it were truly no better than the alternative." This accepts the premise that the product quality is a fact, and only uses probability to describe the (noisy / probabilistic) measurements, i.e. "5% chance of looking as good".

Parent is right to pick up on this, if we're talking about a single product (or, in medicine, if we're talking about a single study evaluating a new treatment). But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.

pkhuong · 2025-06-22T15:12:32 1750605152

One easy slip-up with discussing p values in the context of a workflow or a decision-making process is that a process with p < 0.05 doesn't give us any bound on the actual ratio of actually good VS lucky changes. If we only consider good changes, the fraction of false positive changes is 0%; if we only consider bad changes, that fraction is 100%. Hypothesis testing is no replacement for insight or taste.

kgwgk · 2025-06-22T16:03:38 1750608218

> But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.

It’s not reasonable unless there is a real difference between those “many products” which is large enough to be sure that it would rarely be missed. That’s a quite strong assumption.

jonahx · 2025-06-22T22:23:35 1750631015

This is the key point.

datastoat · 2024-11-22T16:02:54 1732291374

Non-Bayesian NN training does indeed use regularizers that are chosen subjectively —- but they are then tested in validation, and the best-performing regularizer is chosen. Thus the choice is empirical, not subjective.

A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more. So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.

panda-giddiness · 2024-11-22T17:46:42 1732297602

You can, in fact, do that. It's called (aptly enough) the empirical Bayes method. [1]

[1] https://en.wikipedia.org/wiki/Empirical_Bayes_method

datastoat · 2024-11-22T20:10:44 1732306244

Empirical Bayes is exactly what I was getting at. It's a pragmatic modelling choice, but it loses the theoretical guarantees about uncertainty quantification that pure Bayesianism gives us.

(Though if you have a reference for why empirical Bayes does give theoretical guarantees, I'll be happy to change my mind!)

fjkdlsjflkds · 2024-11-23T06:01:23 1732341683

> Non-Bayesian NN training does indeed use regularizers that are chosen subjectively —- but they are then tested in validation, and the best-performing regularizer is chosen. Thus the choice is empirical, not subjective.

I'd argue the choice is still subjective, since you are still only testing over a limited (subjective) set of options. If you are doing this properly (i.e., using an independent validation set), then you can apply the same approach to a Bayesian method and obtain the same type of information ("when I use prior A vs. prior B, how does that change the generalization/out-of-bag error properties of my model?"), without violating any properties or theoretical guarantees of "Bayesianism".

> A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more.

If you subjectively define a set of possible priors (i.e., distributions and parameters) to test in a validation setting, then you are not picking your prior based on the data (again, assuming that you have set up a leakage-free partition of your data in training and validation data), and you are not doing empirical Bayes, so you are not violating any supposed "principled quantification of uncertainty" (if you believe that applying a standard subjective Bayesian approach provides you with "principled quantification of uncertainty").

My point was that, in practice, there are ways of choosing (subjective) priors such that they provide sufficient regularization while ensuring that their impact on the results is minimized, particularly when you can assume certain things about the scale of data (and, in the context of neural networks, you often can, due to things like "normalization layers" and prior scaling of inputs and outputs): "subjective" doesn't have to mean "arbitrary".

> So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.

I won't argue about the fact that training NN using Bayesian approaches is computationally unwieldy. I just don't see how evaluating a modelling decision (be in Bayesian or non-Bayesian modelling), using a proper validation process, would violate any specific theoretical guarantees.

If you can explain to me how evaluating the generalization properties of a Bayesian training recipe on an independent dataset violates any specific theoretical guarantees, I would be thankful (note: as far as I am concerned, "principled quantification of uncertainty" is not a specific theoretical guarantee).

datastoat · 2024-11-21T23:09:04 1732230544

I like Bayesian inference for few-parameter models where I have solid grounds for choosing my priors. For neural networks, I like to ask people "what's your prior for ReLU versus LeakyReLU versus sigmoid?" and I've never gotten a convincing answer.

stormfather · 2024-11-22T14:12:45 1732284765

I choose LeakyReLU vs ReLU depending on if it's an odd day of the week, LeakyReLU being the slightly favored odd-days because it's aesthetically nicer that gradients propagate through negative inputs, though I can't discern a difference. I choose sigmoid if I want to waste compute to remind myself that it converges slowly due to vanishing gradients at extreme activation levels. So its empiricism retroactively justified by some mathematical common sense that let's me feel good about the choices. Kind of like aerodynamics.

duvenaud · 2024-11-22T01:42:09 1732239729

I agree choosing priors is hard, but choosing ReLU versus LeakyReLU versus sigmoid seems like a problem with using neural nets in general, not Bayesian neural nets in particular. Am I misunderstanding?

pkoird · 2024-11-22T00:11:07 1732234267

Kolmogorov Arnold nets might have an answer for you!

dccsillag · 2024-11-22T04:15:43 1732248943

Ah, Kolmogorov Arnold Networks. Perhaps the only model I have ever tried that managed to fairly often get AUCs below 0.5 in my tabular ML benchmarks. It even managed to get a frankly disturbing 0.33, where pretty much any other method (including linear regression, IIRC) would get >=0.99!

SpaceManNabs · 2024-11-22T20:23:32 1732307012

Why do you think they perform so poorly?

dccsillag · 2024-11-22T21:04:00 1732309440

Theory-wise, I'm not convinced that the models have good approximation properties (the Kolmogorov-Arnold / Kolmogorov Superposition Theorem they base themselves on has quite a bit of nuance), and the optimization problem might be a bit tricky. I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering).

In practice, I've personally ran some benchmarks on a collection of datasets I had laying around. The results were generally abysmal, with the method only matching simple baselines in some few datasets.

Finally, the original paper is very weird, and reads more as a marketing piece. The theory, which is touted throughout the paper, is very weak, the actual algorithm is not sufficiently well explained there and the experiments are lacking. In particular, I find it telling that they do not include and even go out of their way to ignore important baselines such as boosted trees, which are the state-of-the-art solution to the problem that they intended to solve (and even work very well in occasions where they claim that both KANs and MLPs perform badly, e.g. in high dimensions).

SpaceManNabs · 2024-11-24T22:01:16 1732485676

Thanks for the detailed answer. So I guess the main issue with KANs is that they don't work as good. I wonder if that shortfall could be because we have spent more time setting up KANs for learning as much as we can for things like MLPs. I am not surprised though that KANs don't beat boosted trees and such. MLPs dont really either.

Only one follow up question:

> I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering)

A lot of the way we induct biases in the traditional network setting (activations are on the node instead of on the edge like in KAN) is by using graph-based architectures, like convolution or transformers, or by setting up particular losses and optimizations like in equivariant networks. Can't we do the same thing for KANs?

jwuphysics · 2024-11-22T00:16:32 1732234592

Could you say a bit more about how so?

pkoird · 2024-11-22T00:28:07 1732235287

KANs have learnable activations based on splines parameterized on few variables. You can specify a prior over those variables, effectively establishing a prior over your activation function.

salty_biscuits · 2024-11-21T23:22:08 1732231328

I'm sure there is a way of interpreting a relu as a sparsity prior on the layer.

datastoat · on Jan 27, 2024

The article explained that there are two roughly equal drivers: (1) Water is a better heat reserve than land, and winds tend to blow eastwards, so Europe gets air warmed by the sea and the US east coat gets colder air that's come from the land. (2) The joint effect of the altitude of the Rockies and the angular rotation of the earth mean that air currents are southeast over the Rockies and then northeast, so arctic air gets pulled down and then pushed back up over the US east coast.

wongarsu · on Jan 27, 2024

Or in short: Western Europe's climate is what you would expect from a place close to an ocean. Especially when by Europe you mean the island that's Great Britain. It's the US East coast that's weird, with a climate that's a lot more continental than you would naively expect.

patientzero · on Jan 27, 2024

When looking only at 3 continental coasts, 2 West coasts and one East coast, the East one is just weird seems like it is missing the point just stated.

wongarsu · on Jan 28, 2024

It is worth noting that even for a West coast Europe is very maritime because in addition to the Atlantic it has the North Sea and Baltic kind of in the middle of it, and the Mediterranian to its South. And because people (including this article) like to take either Spain (sitting on a peninsular) or Great Britain (an island) as points of reference. And the (upper half of the) US East coast is very continental even for an East Coast, because of the effects of the Rockies mentioned in the article.

But overall the point the article tries to make is that people like to say that Europe is unusually mild when really people just compare it to the wrong places (with the US East coast being a popular and uniquely bad comparison du to the reasons summarized by GP).

mediumsmart · on Jan 27, 2024

Does that mean that the air in Europe is warmed by the other seawater that is not part of the gulfstream myth and the cold arctic air gets pushed back to US while the warmer air stays in Europe? So in reality the mild weather in Europe is a result of the angular rotation of planet earth in combination with the Rockies altitude?

readthenotes1 · on Jan 27, 2024

I don't know how long I read looking for these answers. Tank-you for your service.

datastoat · on Jan 12, 2024

It'd be fun (and a bit scary) to use an LLM as a shell replacement. We'd give it the history of our commands as per the recent post [0], as well as their outputs, and it would turn natural-language commands into proper bash. The xkcd comic [1] would be solved instantly. "Tar these files, please." "Delete all the temporary files but please please please don't delete anything else." I'm sure people have implemented this, but my searching isn't good enough to find it.

[0] https://news.ycombinator.com/item?id=38965003

[1] https://xkcd.com/1168/

floren · on Jan 12, 2024

I have briefly told ChatGPT 3.5 about the syntax of a CLI tool I wrote, then asked it to perform a few operations. It did a surprisingly good job, even when I said "and format the result as JSON with fields named Command and Description where the latter explains what the command does".

If I was to actually use this in a real system, I'd definitely build a restricted shell to execute in, and probably run it inside a Docker container with just the essential files mapped in, because I don't trust an LLM not to describe what it's doing as "updating timestamps" or whatever but actually the command is "rm -rf ~"

datastoat · on Dec 15, 2023

According to the Wikipedia page for eszett [0] it evolved from "sz", as the name "eszett" suggests. (I only realized the link with "z" when I saw "tz" ligatures on street signs in Berlin.) Given that its typographic origin is sz, and given that its name literally says sz, I wish the spelling reformists had gone for sz rather than ss!

[0] https://en.wikipedia.org/wiki/%C3%9F

datastoat · on Dec 7, 2023

They use Pyodide, a full Python interpreter in WASM: https://pyodide.org/en/stable/console.html

Pyodide includes manyuseful Python libraries including numpy, pandas, and matplotlib.

datastoat · on Sept 2, 2023

Windows 8 (Metro) used semantic zoom. It's been a while, but I do remember that one of the apps that used it very nicely was Photos. A search for "windows metro semantic zoom" comes up with lots of articles about semantic-zoom-aware GridView controls etc.

Why isn't it commonplace? I think that touchscreen laptops are still too much a minority, and keyboard + mouse + monitor are too entrenched, for anyone to seriously attempt it again for a while. (A shame -- I'm one of the few who really liked the Windows 8 Metro interface.) I think that phones are too small for it to really work well. I don't know why it's not more popular on tablets.

datastoat · on Sept 1, 2023

As an Australian I'm used to hearing "antipodean" and sometimes "antipodes", so "octopodes" sits well!

datastoat · on June 16, 2023

That xkcd comic highlights the problem with observational (as opposed to controlled) studies. TFA is about A/B testing, i.e. controlled studies. It’s the fact that you (the investigator) is controlling the treatment assignment that allows you to draw causal conclusions. What you happen to believe about the mechanism of action doesn’t matter, at least as far as the outcome of this particular experiment is concerned. Of course, your conjectured mechanism of action is likely to matter for what you decide to investigate next.

Also, frequentism / Bayesianism is orthogonal to causal / correlational interpretations.

carlmr · on June 16, 2023

I think what kevinwang is getting at, is that if you A/B test with a static version A and enough versions of B, at some point you will get statistically significant results if you repeat it often enough.

Having a control doesn't mean you can't fall victim to this.

ricardobeat · on June 16, 2023

You control statistical power and the error rate, and choose to accept a % of false results.

majormajor · on June 16, 2023

AB tests are still vulnerable to p-hacking-esque things (though usually unintentional). Run enough of them and your p value is gonna come up by chance sometimes.

Observational ones are particularly prone because you can slice and dice the world into near-infinite observation combinations, but people often do that with AB tests too. Shotgun approach, test a bunch of approaches until something works, but if you'd run each of those tests for different significance levels, or for twice as long, or half as long, you could very well see the "working" one fail and a "failing" one work.

eVoLInTHRo · on June 16, 2023

The xkcd comic seems more about the multiple comparisons problem (https://en.wikipedia.org/wiki/Multiple_comparisons_problem), which could arise in both an observational or controlled setting.