Seems like a good collection of standard ML techniques, introduced with a fairly unified mathematical notation and quite a few proofs. Quite the Herculean effort (600 pages!). It just seems to me like they're putting the emphasis on the stuff that is more straightforward to formalize rather than the stuff that would be interesting to understand.
Look eg at the SGD chapter. I picked this because I think optimization is one of the areas where mathematicians actually can and do make impactful contributions to ML. But then look at the chapter in the book: most of the proofs are fairly elementary (like bias-variance decompositions or Jensen inequalities), some more interesting theorems (on convergence) are cited from the literature and do not build on the lemmata, and the sub-chapters on the actually interesting methods like ADAM,... are completely free of proofs or theory. It seems to me that after reading the chapter, a reader will have a good understanding of modern SGD methods and how we got there, but they won't necessarily be much wiser about why those methods work, other than having a good intuition confirmed by numerical experiments. If that's the outcome, then I wonder what the fuss proving all the basic stuff was all for. Wouldn't it be more useful to dedicate the space to convergence proofs for ADAM (which do exist) rather than showing lots of stuff like E(XY) = E(X)E(Y) for independent random variables?
That's just one chapter, I may not be doing them full justice here, although I did read through a few others as well. I first got this impression from the ANN chapter, which is ripe with long proofs for rather basic and uninteresting stuff, and from the physics-informed neural networks paper (which I actually find really nice, although it suffers a bit from the same problem as the SGD chapter). I don't want to be too critical here, it is nice in general to move towards a more rigorous and unified exposition of ML methods, and their approach should extend to the more technical results as well, just questioning where they drew the line of what to include and what not.
> they won't necessarily be much wiser about why those methods work, other than having a good intuition confirmed by numerical experiments.
This is the state of the field as a whole, isn't it?
> Wouldn't it be more useful to dedicate the space to convergence proofs for ADAM (which do exist)
Convergence proofs don't really explain why Adam tends to work better than other methods.
It's hard to blame them for not being able to explain things that, currently, nobody understands. But I guess it kind of undermines the idea of a theory-heavy approach to teaching if the theory we have can't predict the things that are actually important.
ADAM is known to have better convergence bounds than other methods. Theoretical bounds may not explain the full story of why a method works well, but it is how mathematicians reason about it. I'm only blaming them for not sharing those relevant parts of what we already know.
My bigger pain point even is how they choose to allocate their space: the theorem statements for the most relevant results are missing, the proofs for the more interesting theorems are just citations, while the proofs for basic and arguably tangentially relevant lemmata from eg probability theory take up pages and pages.
for those who want some maths-heavy stuff for deep learning, check francois fluret's book https://fleuret.org/francois/lbdl.html. the pdf is free but the print is so cute.
Has anyone figured out a way to print the Fleuret book on A4 paper? Every other page ends up upside down when I've tried it, which is problematic with a duplexer.
Off topic but I'm replying here since the medieval cat thread is closed for comments. I found my copy of Catwatching and Desmond Morris does indeed claim on page 12 that cats were persecuted in the Middle Ages.
"These good times for cats were not to last, however. In the Middle Ages the feline population of Europe was to experience several centuries of torture, torment, and death at the instigation of the Christian church. Because they had been involved in earlier pagan rituals, cats were proclaimed evil creatures, the agents of Satan and familiars of witches. Christians everywhere were urged to inflict as much pain and suffering on them as possible. The sacred had become the dammed. Cats were publicly burned alive on Christian feast days. Hundreds of thousands of them were flayed, crucified, beaten, roasted, and thrown from the tops of church towers at the urging of the priesthood, as part of a vicious purge against the supposed enemies of Christ."
As someone who has a deeper knowledge of programming rather than math, I find the mathematical notation here to be harder to understand than the code (even in a programming language I do not know).
Does anyone with a stronger mathematical background here find it easier to understand the math as written more easily than the source code?
Sharing my experience here. My background is in math (Ph.D. and a couple of postdoc years) before switching to practitioner in deep learning. This year I taught a class at university (as invited prof) in deep learning for students doing a masters in math and statistics (but with some programming knowledge, too).
I tried to present concepts in an as reasonably accurate mathematical way as possible, and in the end I cut through a lot of math in part to avoid the heavy notation which seems to be present in this book (and in part to make sure students could spend what they learnt in the industry). My actual classes had way more code than formulas.
If you want to write everything very accurately, things get messy, quickly. Finding a good notation for new concepts in math is very hard, something that gets sometimes done by bright minds only, even though afterwards everybody recognizes it was “clear” (think about Einstein notation, Feynman diagrams, etc., or even just matrix notation, which Gauss was unaware of). If you just take domain A and write in notations from domain B, it’s hard to get something useful (translating quantum mechanics to math with C* algebras and co. was a big endeavour, still an open research field to some extent).
So I’ll disagree with some of the comments below and claim that the effort of writing down this book was huge but probably scarcely useful. Who can read comfortably these equations probably won’t need them (if you know what an affine transformation is, you hardly need to see all its ijkl indices written down explicitly for a 4-dimensional tensor), and the others will just be scared off. There might be a middle ground where it helps some, but at least I haven’t encountered such people…
Mathematical notation is more concise, which may take some getting used to. One reason is that it is optimized for handwriting. Handwriting program code would be very tedious, so you can see why mathematical notation is the way it is.
Apart from that, there is no “the code” equivalent. Mathematical notation is for stating mathematical facts or propositions. That’s different from the purpose of the code you would write to implement deep-learning algorithms.
The last part was a big hurdle for me as an early undergrad. I was a fairly strong programmer toward the end of high school, and was trying to think of math as programming. That worked for the fairly algorithmic high school stuff and I got good grades, but it made I was awful at writing proofs. I also went through a phase where I used all the logical notation and rules to manipulate it possible in order to make proofs more algorithmic to me, but that both didn’t work well for me and produced some downright unreadable results.
Mathematical notation is really a shorthand for words, like you’d read text. The equals sign is literally short for equals. The added benefit, as others have pointed out, is that a good notation can sometimes be clearer than words because it makes certain conclusions almost obvious. You’ve done the hard part in finding a notation that captures exactly the idea to be demonstrated in its encoding, and the result is a very clean manipulation of your notation.
This is essentially my problem. I started writing programs at a young age and was introduced (unknowingly) to many more advanced mathematical concepts from that perspective rather than through pure mathematics. What was it that helped break this paradigm for you?
Really trial and error and grinding through proofs. Working through Linear Algebra Done Right was a big a-ha moment for me. Since I was self-studying over the summer (to remedy my poor linear algebra skills), I was very rigorous in making sure I understood every line of the proofs in the text and trying to mimic his style in the exercises.
In hindsight, I think the issue was trying to map everything to programming is a bad idea and I was doing it because programming was the best tool in my tool chest. It was a real “when all you have is a hammer, everything looks like a nail” issue for me.
Ah, I think I remember bookmarking this when it was posted before. You really don't have to go very far in computing to find a frontier where most everything in described pure mathematics and so it becomes a substantial barrier for undiversified autodidacts in the field. The math in these areas can often be quite advanced and difficult to approach without the proper background and so I appreciate anyone who has made taken the time to make it less formidable to others.
I appreciate that some may find the book useful, but I personally don't agree with the presentation. There are too many conceptual errors in the book that you need to unlearn to make progress. For example, the book describes R^2 as a "pair" of real numbers. This is very much untrue and that kind of thinking will lead you even further astray.
I say this as someone with a math/cs degree and PhD having taught these topics to hundreds of students.
>For example, the book describes R^2 as a "pair" of real numbers.
I naturally auto-corrected this "(the set of) pairs of real numbers".
If that's the case, then I don't see how this differs from the actual definition.
What is the conceptual error? Is it the missing 'set of'?
> For example, the book describes R^2 as a "pair" of real numbers.
From page 15:
> The one piece of new notation is the exponent on R^2. This means "pairs" of real numbers.
Your interpretation of this quote is uncharitable at best. Using it to make a blanket assertion about the book is just silly, and quite out of the spirit of mathematics.
In particular, page 19 has an example of the kind of things that my book has that other books don't: a discussion of the soft skills of learning math and the cultural acclimation process:
> Though it sometimes makes me cringe to say it, give the author the benefit of the
doubt. When things are ambiguous, pick the option that doesn’t break the math. In this
respect, you have to act as both the tester, the compiler, and the bug fixer when you’re
reading math. The best default assumption is that the author is far smarter than we are,
and if you don’t understand something, it’s likely a user error and not a bug. In the
occasional event that the author is wrong, it’s often a simple mistake or typo, to which
an experienced reader would say, “The author obviously meant ‘foo’ because otherwise
none of this makes sense,” and continue unscathed.
The course you suggested is the sort of "grab bag of topics" course, meant to cram the basics of every topic a CS major might want to know for doing the kind of CS theory research that MIT cares about. If you find math hard, I doubt that will make it much easier, but it could be good to do alongside a book like mine if you find my book too easy.
Yeah I know it’s a common challenge. I think it took me a bit longer than some of my peers because I was trying to force it to be like something I knew instead of meeting it on its own terms.
> Mathematical notation is for stating mathematical facts or propositions.
And as such it is way too often abused. Because the (original, and the most useful) purpose of mathematical notation is to enable calculation, i.e., in a general sense, to make it possible to obtain results by manipulating symbols according certain rules.
Sure, but the whole point is to avoid the need to do that! Manipulating symbols is the way to automate reasoning, i.e. to get to a result while completely ignoring said "facts." Using the symbols to merely "state the facts" is abuse (of the reader, mostly).
So this is a book written by applied mathematicians for applied mathematics (they state in the preface it’s for scientists, but some theoretical scientists and engineers are essentially applied mathematics). As a result, both the topics and the presentation are biased towards those types of people. For example, I’ve never seen in practice worry about the existence and uniqueness conditions for their gradient-based optimization algorithm in deep learning. However, that’s the kind of result those people do care about and academic papers are written on the topic. The title does say that this is a book on the theoretical underpinnings of the subject, so I am not surprised that it is written this way. People also don’t necessarily read these books cover-to-cover, but drill into the few chapters that use techniques relevant to what they themselves are researching. There was a similarly verbose monograph I used to use in my research, but only about 20-30 pages had the meat I was interested in.
This kind of book is more verbose than my liking both in terms of rigor and content. For example, they include Gronwall’s inequality as a lemma and prove it. The version that they use is a bit more general than the one I normally see, but Gronwall’s inequality is a very standard tool in analyzing ODEs and I have rigorous control theory books that state it without proof to avoid clutter (they do provide a reference to a proof). A lot of this verbosity comes about when your standard of proof is high and the assumptions you make are small.
Are there any books you recommend for deep learning that are written for developers who don't use math every day?
I suppose the goal would be to understand deep learning so that we know enough of what's going on but not to get stuck in math concepts that we probably don't know and won't use.
I am/was in this scenario. I'm sure there are other resources out there specifically aimed at developers, but a book I'm reading now is "Deep Learning From Scratch" by Seth Weidman. He takes a different approach, by explaining concepts in three distinct methods: a mathematical way, by using diagrams and by showing the code.
I like this approach because it allows me to connect the math to the problem, whereas otherwise you wouldn't have.
I think if you are truly trying to understand deep learning, you will never get to avoid the math because that's really what it is at it's core, a couple of (non-linear) functions chained together (obvious gross oversimplification).
All three authors are PhDs or PhD-candidates in mathematics. The notation is extremely dense. I'm curious who their target audience of "students and scientists" are for this book.
Likely graduate students with a very theoretical interest. Some theoretically-oriented scientists and engineers are also basically applied mathematicians. It is presumably targeted at people that want to further develop the theoretical aspects of learning, as opposed to applied practitioners
I have a strong mathematical background, and I found the notation completely insane. Right out of the gate in chapter 1 we get a definition that has subscript indices in the subscript index and a summation with subscripts in the superscript, and then composed in a giant function chain. Later we get to 4-level subscripts deep, invent at least 3 new infix operators, define 30 new symbols from 3 different alphabets and we're barely at page 100 out of 600. I have no idea who is supposed to follow and digest this
I’m not sure what specialization of math you studied, but using superscripts for indices is pretty common where you’re dealing with multi-dimensional objects. I used it in a lot of the courses in my degree.
They are not complaining about superscripts for indices, but about having a subscripts in those superscripts. Basically like x² but the ² has a subscript of its own. That is very dense and graphically hard to follow as notations go.
I’m just wrapping up a PhD in ML. The notation here is unnecessarily complex IMO. Notation can make things easier, or it can make things more difficult, depending on a number of factors.
Really? Coming from physics (B.Sc only) the notation is refreshingly familiar and straightforward. My topology and analysis classes were basically like this.
In fact, this pdf is literally the resource I've been searching for as many others are far too ambiguous and handwavey focusing more on libraries and APIs than what's going on behind the scenes.
If only there were a similar one for microeconomics and macroeconomics, I'd have my curiosity satiated.
As a PhD econ student, the mathematics just comes down solving constrained optimization problems. Figuring out what to consider as an optimand and the associated constraints is the real kicker.
It depends on what you’re doing. That is accurate for, say, describing the training of a neural network, but if you want to prove something about generalization, for example (which the book at least touches on from my skimming), you’ll need other techniques as well
Most economists (who write these sort of textbooks) have some sort of math background. The push to find the most general "math" setting has been an ongoing topic since the 50's and so you can probably find what you are looking for. It's not part of undergraduate textbooks since adding generality gives better proofs but often adds "not that much" to insight.
Nevertheless, the standard micro/macro models are just applications of optimization theory (lattice theory typically for micro, dynamical systems for macro). Game theory (especially mechanism design) is a bit of different topic, but I suppose that's not what you are looking for.
E.g., micro models are just constrained optimization based on the idea of representing preference relations over abstract sets with continuous functions. So obviously, the math is then very simple. This is considered a feature. You can also use more complex math, which helps with certain proofs (especially existence and representation).
You could grab some higher level math for econ textbooks, which typically include the models as examples, where you skip over the math.
For example, for micro, you can get the following:
https://press.princeton.edu/books/hardcover/9780691118673/an...
I think it treats the typical micro model (up to oligopoly models) via the first 50 or so pages while explaining set theory, lattices, monotone comparative statics with Tarski/Topkis etc.
Bishop’s Pattern Recognition and Machine Learning is one example that has tremendous depth and much clearer notation. Deep Learning by Goodfellow et al. is another example, albeit with less depth than Bishop.
I’m glad you’re enjoying the book. The approach is ideal for a very small subset of the ML population, no doubt that was their intention. I’m just weighing in that it’s entirely possible to cover this material with rigour yet much simpler notation. Even as someone who could parse this I’d go with other options.
Thanks for highlighting Bishop to me! I've self-taught through various resources esp. Goodfellow et al 2016. It's taken me a number of years to rebuild my math knowledge so that I feel comfortable with Goodfellow's treatment and look forward to learning from the Bishop book. Fwiw, I've found the math notation in the Goodfellow textbook to be among the best I've ever seen in terms of consistency and clarity. Some other books I enjoy, for example, do not seem to make any typographic indication of whether an object is a vector, scalar, or other. :(
I appreciated the notation in Goodfellow book as well, it was easy enough for me to follow without having a strong mathematics background. I'll agree however with others that this text is instead focused for a different audience and purpose.
Re your question on economics books, I think Advanced Macroeconomics by David Romer could fit your bill. It goes a lot into why the math is the way it is (arguably more interesting, like another poster said). Modern macroeconomics is also built on microeconomics, and to that extent it's covered in the book, so you're sort of getting two-for-one here.
As someone that’s in the later stages of a PhD in math, given the title starts with “Mathematical Introduction…”, the notation feels pretty reasonable for someone with a background in math.
Sure I might want some slight changes to the notation I found skimming through on my phone, but everything they define and the notation they choose feels pretty familiar and I understand why they did what they did.
Mirroring what someone else said, this is exactly the kind of intro I’ve been looking for for deep learning.
Is it fair to call something an introduction if it uses math from an upper division undergrad math criteria? Such as metric theory. My opinion is that it is context driven. E.g. Introduction to Differential Geometry or Introduction to Homotopy Theory. But I think you can't look at the title and infer prerequisites that are within the ballpark. I'd wager most people outside math and some physics students are familiar with Galerkin methods (maybe a handful of engineers) at the undergraduate level. I don't think most outside math and physics even learn PDEs (my engineering friends mostly didn't and my uni's CS program doesn't even require DE).
Looking at the theory as a whole it’s a very small minority.
I’m trying to think if it’s 0 percent outside of backprop…
Arguably high school math gets you quite a bit of understanding. After that in descending order I’d guess Linear Algebra, Statistics/Probability, Basic Calculus, Partial Derivatives…
In other words it’s not all or nothing. The easiest stuff gets you a lot of bang for your buck.
Yes, it's easier for mathematicians, because a lot of background knowledge and intuition is encoded in mathematical conventions (eg "C(R)" for continuous functions on the reals etc...). Note that this is probably a book for mathematicians.
Mathematical notation usually has a problem with preferring single-letter names. We usually prefer to avoid highly abbreviated identifier names in software, because they make the program harder to read. But they’re common in Math, and I think that it makes for a lot of work jumping back and forth to remind oneself what each symbol means when trying to make sense of a statement.
I think the main difference is that in programming you typically use names from your domain, like "request" or "student". But math objects are all very abstract, they don't denote any domain. For example, if I have a triangle and I want to name its vertexes so I can refer to them later, what would be a good name? Should I call them vertexA, vertexB, and vertexC just so it's not a single letter?
Screenshot the math, crop it down to the equation, paste into the chat window.
It can explain everything about it, what each symbol means, and how it applies to the subject.
It’s an amazing accelerator for learning math. There’s no more getting stuck.
I think it’s underrated because people hear “LLM’s aren’t good at math”. They are not good at certain kinds of problem solving (yet), but GPT4 is a fantastic conversational tutor.
Don't suggest this. While I agree it can be helpful, the problem is if you're a novice you won't be able to distinguish hallucinations. Which in my experience are fairly common, especially as you do advance topice. If you got good math rigor then it's extremely helpful, because often things are hard to exactly search, but it's a potential trap for novices. But if you have no better resource, then I can't blame anyone, just give a warning to take care.
> That’s kind of like telling people not to go online because you can’t believe everything you read on the Internet.
Uhhh... it's like telling people to trust SO over reddit, especially a subreddit known to lie.
> What proportion of the problems you’ve encountered were with the free version vs premium? It’s a huge difference and the topic here is GPT4.
Both. Can we stop doing this? This is a fairly well established principle with tons of papers written about it, especially around math. Just search arxiv, there's a new one at least every week
> I don’t get the relevance those seem to be security related?
And?
> My main point consistently has been that GPT4 can be an invaluable resource specifically for learning math subjects.
This can also be true. I use it a lot. Don't confuse openly discussing limitations with calling it a pile of shit. No need to have only two extremes.
> I am not aware of any papers, studying people using it as a conversational tutor for learning math and having problems with hallucinations.
Very bad faith requirement. Unless you have good evidence that GPT hallucinates in many domains (as exemplified by said security report) and NOT math tutoring. If you have this really strong evidence that math tutoring is specifically unique then I suggest writing a paper. I'll help if you really can do it and be happy to give you first author and be proven wrong. But a much easier explanation is that math tutoring is not unique to GPT with regards of generating hallucinations. If you truly believe you do need a extremely specific example, you may need to pull the wool off your eyes. But I'm hoping you don't and are just arguing.
A lot of negativity comes from people who goofed around with 3.X for a while, came away unimpressed, muttered something under their breath about stochastic parrots or Markov chains that sounded profound (at least to them), and never bothered to look any further. 4 is different. 4 is starting to get a bit scary.
The real pedagogical value comes when you try to reconcile what it tells you about the equations with the equations themselves. Ask for clarification when something seems wrong, and there is an excellent chance it will catch its own mistakes.
That answer isn't very compelling as it is one of the most well known equations in ML. There are some very minor errors but nothing that changes the overall meaning. But you even seem to agree with me in your followup: don't rely on it, but use it. I'm only slightly stronger than you.
And stop all this 3.5 vs 4 nonesense. We all know 4 is much better. But there's plenty of literature that shows its limits, especially around memorization. You also don't understand stochastic parrots, but in fairness, seems like most people don't. LLMs start from compression algorithms and they are that at their core. But this doesn't mean it is a copy machine despite the NYT article but it also doesn't mean it is a thinking machine like the baby AGI people. Truth is in between but we can't have a real conversation because hype primed us to just bundle people into two camps and make us all true believers. Just please stop gaslighting people when they say they have run into issues. The machine is sensitive to prompts, so that can be a key difference or sometimes they might just see mistakes you don't. It's not an oracle so don't treat it like one. And don't confuse this criticism as saying LLMs suck, because I use them almost every day and love them. I just don't get why we can't be realistic about their limits and can only believe they are a golden goose or pile of shit. It's, again, neither.
You have a parrot that can paint original pictures, compose original songs and essays, and translate math into both English and program code?
I would like to buy your parrot. I'll keep it in my Chinese room. There used to be a guy in there, but he ran away screaming something about a basilisk.
> You have a parrot that can paint original pictures, compose original songs and essays, and translate math into both English and program code?
Kinda, kinda, yes, and yes.
I think there's far less originality than most people think. But it's not surprising when your job isn't leading you to look at thousands of pictures a day. I have yet to see a generative model that isn't pulling heavily towards the training data and you might be noticing the memorization rates are getting higher. But yes, a stochastic parrot doesn't mean memorization, it is about generalization and the stability around the p-norm ball around the training data.
Btw, what's wrong with a stochastic parrot? They are absolutely fucking useful. I use them every day. Hell, I even use things that are complete memorizations and all compression every day. What's with everyone equating powerful statistical systems with uselessness. Anyone saying that they aren't extremely useful is pulling wool over their eyes (but the same is true for anyone claiming baby AGI).
I'd also appreciate it if you discussed in good faith. The snarkiness is not appreciated.
I'm not being snarky! I genuinely feel I'm the one being gaslighted, by people telling me I shouldn't be utterly blown away by answers like the earlier example, or the one I just received:
I regularly get downvoted and criticized for suggesting this tool to other students, in defiance of what I can clearly see happening with my own eyes. I see a tool that, if developed further, will answer much deeper questions, including original ones, just as accurately and effectively. One that appears capable of taking humanity to the next level so fast it will make the monolith in 2001 look like an abacus by comparison.
Meanwhile, you tell me, "Don't suggest this to other students, it might hallucinate." Other people say, "Shut this down at once (or nerf it beyond any possibile utility), it might hurt somebody's feelings." Another contingent warns, "Shut this down at once, it might start a nuclear war." Still other people say, "Shut this down at once, it violates copyright law." The objections just get dumber from there, yet gain traction by the day.
There's never been a time when standing in the way of something like this was right. Why should I think it's time to do so now? (And yes, I acknowledge that you're not personally 'standing in the way', but it really bugs me when people who claim they aren't 'standing in the way' of the technology tell other people not to use it.)
I have yet to see a generative model that isn't pulling heavily towards the training data
When's the last time you saw a human mind that didn't work that way? (Or, for that matter, a parrot's mind.) The real truth behind the stochastic-parrot metaphor is that parrots, stochastic or otherwise, are nothing all that special, and neither are we. We're just better at using tools than the birds are, that's all.
Or at least we were up until now. But muh COPYRITE!!!11! ...
> I genuinely feel I'm the one being gaslighted, by people telling me I shouldn't be utterly blown away by answers like the earlier example, or the one I just received:
I think people in my camp (which often are confused with the Gary Marcus camp), aren't saying you shouldn't be blown away. Those people wouldn't say this
> And don't confuse this criticism as saying LLMs suck, because I use them almost every day and love them. I just don't get why we can't be realistic about their limits and can only believe they are a golden goose or pile of shit. It's, again, neither.
Fwiw, I give those people an ever harder time. They deny utility that is quite apparent. They also have these silly contrived doomer arguments that don't make any sense, as if one day AGI is just going to unexpectedly appear out of nowhere and, like you suggest, somehow jump the airgap and get control of the world's nuclear weapons without anyone noticing. What an insane hypothesis that doesn't have anything substantial evidence and is entirely based on "but what if!" It is conspiratorial and a distraction from the real harm these systems can do which is far more subtle and not really an existential crisis (at least arguably in the same way, but let's not get into that). Some of these people are shills and some are useful idiots/true believers. You're right to not pay attention to them.
I'll also mention that I too am blown away. But you can be blown away and still have criticism and be wary of a thing too. The answer is quite impressive, without a doubt. I mean we are literally putting lightning into rocks and making them capable of doing math and speaking human languages. If you're not blown away by any single one of those things then it is simply a lack of imagination.
> When's the last time you saw a human mind that didn't work that way?
Quite frequently. Same with even my cat, and she's dumb as shit. Probably ran into too many walls while chasing toys but I think that's just a feedback loop lol. She's dumb as shit but I'm also absolutely blown away by her brilliance. It may be hard to see that both those can be true, but that's the true state. But I disagree that there isn't anything special about stochastic parrots, any animal, or humans. They are all mind mindbogglingly impressive, just our brains are designed to normalize things to not be overburdened by the computational load (which itself is impressive!).
You are absolutely right though that there's a ton of exploitation that humans do (referring to exploration vs exploitation). I said memorization is incredibly useful. But creativity is far more subtle. I should put it this way, chimps (very impressive creatures), are far better at memorization than most humans. But they are nowhere near as creative. Certainly some creativity is leveraging prior works for inspiration. But a subtle aspect of this is that often when this form is considered brilliant it crosses domains, which is something no ML seems to even have the capacity to perform. This can be hard to know though because unless you have domain knowledge you may not have heard about how people like Einstein was called a mathematician and not a physicist or how Nash was said to "just used topology". This type of lore is important if we're going to discuss actual intelligence but not important for tools or our every day lives. The devil is in the details when we care about details.
It can be really hard to understand these distinctions. You have to look REALLY close at details. One thing I'll mention is that I know I have looked at the datasets we use in our group far more than anyone else that I know. This is unsurprisingly an uncommon thing because it is boring to look at the raw data and investigating things like LAION takes herculean efforts (something I haven't even approached). But your example is actually remarkably relevant to this topic. You couldn't have done anything better! Because most people rely on measurements of distance like cosine similarity or L2 to determine duplicates or near duplicates. But ask GPT this (you should get the right answer): "How does the curse of dimensionality relate to distance measurements in higher dimensions? Are there any problems this creates?" Or ask it another one, which even the fact blew me away the first time I heard it despite being absolutely obvious after I took just a moment to think about it: "If I have a n dimensional space, where n is very large, what is the expected angle between any two random tensors? What is it as n approaches infinity?" I'm positive it will again give you the correct answer.
But you also have to realize that this is frequently written about and without a doubt in the training data. You can absolutely overfit models and have them be incredibly useful. But the difference is that this won't be generalization and will be brittle. For a long time GPT was not able to correctly answer "Which weighs more, a pound of feathers or a kilogram of bricks" because it was too sensitive to the expected answer (it'll work now btw). It still has problems with a variation of the corn, goose, fox river crossing puzzle if you change it to allow all items in the boat at once (at least when I checked a month ago). But this is not the actions of sentient creatures. Ones that can think and comprehend. You're going to have to think really hard about how you think and especially how you think really hard to get a good understanding of this. But it comes down to the reason why someone can be absolutely brilliant while shockingly idiotic. This is not the quip from iRobot with the "can you?" about art and symphonies. There is something deeper and truth be told, many animals do things for no good reason (one that can't be clearly defined by our perceived loss functions, which may accurately be called emergent behaviors). Every mammal also is able to run complex simulations in their minds, at incredibly low computational costs. Even the small rat will twitch its legs while it sleeps or your dog may bark, being unable to distinguish reality from a dream, just as you do. That is truly a world model. Something we aren't remotely close to in AI, but that's okay. Why would it not be okay?
But in some way you are being gaslit, but not by what I intended to say (but maybe from how you read it. Which I apologize, I am trying to work on communicating better, but it is hard when we have a diverse global audience with many different base assumptions and knowing which type of imputation I need to direct my message at). There are plenty of people with highly invested interest to sell you these tools as far more than they are. I've written a few comments before that what's going on is as if we made a chocolate factory. One that sells the best god damn chocolate you're ever tasted. But then they started selling the chocolate as a cure for cancer. At that point, it doesn't matter how good the chocolate is, people will feel disenfranchised. Some people are responding by saying that the chocolate tastes like shit while others are saying it cured their cancer. But neither of these are true. It's damn good chocolate, but it isn't going to cure cancer. (ML certainly will be a very useful tool for tackling cancer. That was not the intent of this analogy) I just think there's this fear that people have that if something isn't a literal gift from god then it is a pile of shit, and I don't get it. Nothing we have fits that description but we have done and created so many incredible things as humans and made such leaps and bounds with these half baked incomplete things. There is nothing wrong with just okay chocolate, but the chocolate we have is without a doubt, better than just okay.
Honestly, because the very first sentence of the preface is "This book aims to provide an introduction to the topic of deep learning algorithms." Really? LOL. If you're going to pitch 600 pages of dense mathematical notation as "introductory," you're going to have to expect some people to call BS.
What's interesting/unfortunate is that their Python code samples really are easy to follow and pedagogically useful to a beginner. I think a lot of people will be turned off by the text unnecessarily.
It should have been promoted as a rigorous reference textbook, which is what it is, and not any sort of tutorial or primer.
I've seen quite a few of these books attempting to explain deep learning from a mathematical perspective and it always surprises me. Deep learning is clearly an empirical science for the time being, and very little theoretical work that has been so impactful that I would think to include it in a book. Of the such books I've seen, this one seems like actively the worst one. A significant amount of space is dedicated to proving lemmas that provide no additional understanding and are only loosely related to deep learning. And a significant chunk of the code I see is just the plotting code, which I don't even understand why you'd include. I'm confident that very few people will ever read significant chunks of this.
I think the best textbooks are still Deep Learning by Goodfellow etal and the more modern Understanding Deep Learning (https://udlbook.github.io/udlbook/).
This book is not aimed at practitioners but I don’t think that means it deserves to be called „actively the worst one”.
Even though the frontier of deep learning is very much empirical, there’s interesting work trying to understand why the techniques work, not only which ones do.
I’m sorry but saying proofs are not a good method for gaining understanding is ridiculous. Of course it’s not great for everyone but a book titled „Mathematical Introduction to x” is obviously for people with some mathematical training. For that kind of audience lemmas and their proof are natural way of building understanding.
Just read the section on ResNets (Section 1.5) and tell me if you think that's the best way to explain ResNets to literally anyone. Tell me if, from that description, you take away that the reason skip connections improve performance is that they improve gradient flow in very deep networks.
Neither do the authors in the book, and I'd argue that after (only) reading the book, the reader wouldn't be equipped to attempt this either (see my other post in this thread), so I think the parent poster has a point.
Yes, I have a very good point in fact. But the above comment purposely chooses not to argue with it, because it's easier to ignore it entirely and argue something else.
The problem is you presented something as a fact while it’s just a guess. Some people guess it’s an improved gradient flow, others guess it’s a smoother loss surface, someone else guesses it’s a shortcut for early layer information to reach later layers, etc. We don’t actually know why resnets work so well.
The point of that comment doesn't have anything to do with how ResNets actually work. You missed the actual point.
> We don’t actually know why resnets work so well.
Yes actually we do. We know, from the literature, that very deep neural networks suffered from vanishing gradients in their early layers in the same way traditional RNNs did. We know that was the motivation for introducing skip connections which gives us a hypothesis we can test. We can measure, using the test I described, the differences in the size of gradients in the early layers with and without skip connections. We can do this across many different problems for additional statistical power. We can analyze the linear case and see that the repeated matmults should lead to small gradients if their singular values are small. To ignore all of this and say that well we don't have a general proof that satisfies a mathematician so i guess we just don't know is silly.
You're doing it again - presenting guesses as facts. Why would a resnet - a batch normalized network using ReLU activations suffer from vanishing gradient problem? Does it? Have you actually done the experiment you've described? I have, and I didn't see gradients vanish. Sometimes gradients exploded - likely from a bad weights initialization (to be clear - that's a guess), and sometimes they didn't, but even when they didn't the networks never converged. The best we can do is to say: "skip connections seem to help training deep networks, and we have a few guesses as why, none of which is very convincing".
We know, from the literature
Let's look at the literature:
1. Training Very Deep Neural Networks: Rethinking the Role of Skip Connections: https://orbilu.uni.lu/bitstream/10993/47494/1/OyedotunAl%20I... they're making a hypothesis that skip connections might help prevent transformation of activations into singular matrices, which in turn could lead to unstable gradients (or not, it's a guess).
2. Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization: https://openreview.net/pdf?id=LJohl5DnZf they are making some hypothesis about an optimal information flow through the network, and that a particular form of regularization helps improve this flow (no skip connections are needed).
3. Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers https://arxiv.org/abs/2203.08120: focus on initial conditions and propose better activation functions.
Clearly the issues are a bit more complicated than the vanishing gradients problem, and each of these papers offer a different explanation of why skip connections help.
It's similar to people building a bridge in 15th century - there was empirical evidence and intuition of how bridges should be built, but very little theory explaining that that evidence or intuition. Your statements are like "next time we should make the support columns thicker so that the bridge doesn't collapse", when in reality it collapsed due to the resonant oscillations induced by people marching on it in unison. Thicker columns will probably help, but they do nothing to improve understanding of the issue. They are just a guess.
That's why we need mathematicians looking at it, and attempting to formalize at least parts of the empirical evidence, so that someone, some day, will develop a compelling theory.
Empirically yes, I can consider a very deep fully-connected network, measure the gradients in each layer with and without skip connections, and compare. I can do this across multiple seeds and run a statistical test on the deltas.
Empirical studies are only useful until the system is mathematically understood. For example, I can construct transformer circuits where the skip connection (provably) purely adds noise.
I can also prove in particular cases the MLP's sole purpose is to remove the noise added from the skip connection.
Math isn't just about proofs. It's a way to communicate. There are several different ways to communicate how a neural net functions. One is with pictures. One is with some code. One is with words. One is with some quite dense math notation.
I would say UDL should be very accessible to any undergrad from a strong program.
I would not call the notation ‘dense’ rather it’s ‘abused’ notation. Once you have seen the abused notation enough times, it makes just makes sense. Aka “mathematical maturity” in the ML space.
My views on this have changed as a first year PhD in ML I got annoyed by the shorthand. Now as someone with a PhD, I get it — It’s just too cumbersome to write out what exactly you mean and you write like you’re writing for peers +\- a level.
I agree with that, I think UDL uses the necessary amount of math to communicate the ideas correctly. That is obviously a good thing. What it does not do is pretend to be presenting a mathematical theory of deep learning. Basically UDL is exactly how I think current textbooks should be presented.
I think the mathematical background starts making sense once you get a good understanding of the topic, and then people make the wrong assumption that understanding the math will help learning the overall topic, but it that's usually pretty hard.
Rather than trying to form an ituition based on the theory, it's often easier to understand the technicalities after getting an intuition. This is generally true in exact sciences, especially mathematics. That's why examples are helpful.
This makes me wonder. Is deep learning as a field an empirical science purely because everyone is afraid of the math? It has the richness of modern day physics but for some reason most the practioners seem to want to keep thinking of it as the wild west
No, there are many very mathematically inclined deep learning researchers. It's an empirical science because the mathematical tools we possess are not sufficient to describe the phenomena we observe and make predictions under one unified theory. Being an empirical science does not mean that the field is a "wild west". Deep learning models are subjectable to repeatable controlled experiments, from which you can improve your understanding of what will happen in most cases. Good practitioners know this.
>It's an empirical science because the mathematical tools we possess are not sufficient to describe the phenomena we observe and make predictions under one unified theory.
To me the deep learning is actually itself a [long-awaited] tool (which has well established, and simple at that, math underneath - gradient based optimization, vector space representation and compression) to make a good progress toward mathematical foundations of the empirical science of cognition.
In the 90-ies there were works showing that for example Gabors in the first layer of the biological visual cortex are optimal for the feature based image recognition that we have. And as it happens in the DL visual NNs the convolution kernels in the first layers also converge to the Gabor-like. I see [signs of] similar convergence in the other layers (and all those semantically meaningful vector operations in the embedding space in LLMs are also very telling). Proving optimality or similar is much harder there, yet to me those "repeatable controlled experiments" (i.e. stable convergence) provide strong indication that it will be the case (as something does drive that convergence, and when there is such a drive in dynamic systems, you naturally end asymptotically up ("attracted") near something either fixed or periodic), and that would be a (or even "the") math foundation for understanding of cognition (dis-convergence from the real biological cognition, ie. emergence of completely different, yet comparable, type of cognition would also be great, if not even the much greater result) .
A little bit of A and B. You can do a lot with very little math beyond linear algebra, calculus, and undergraduate probability, and that knowledge is mainly there to provide intuition and formalize the problem that you’re solving a bit. You also churn out results (including very impressive ones) without doing any math.
A result of the above is that people are empirically demonstrating new problems and solving them very quickly — much more quickly than people can come up with theoretical results explaining why they work. The theory is harder to come by for a few reasons, but many of the successful examples of deep learning don’t fit nicely into older frameworks from, e.g., statistics and optimal control, to explain them well.
Is anyone using any of this math? My guess is no. At best it provides "moral support" for deep learning researchers who want to feel reassured that what they are attempting to do is not impossible.
There's something I tell my students. You don't need math to make good models, but you do need to know math to know why your models are wrong.
So yes, math is needed. If you don't have math you're going to hoodwink yourself into thinking you can get to AGI by scale alone. You'll just use transformers everywhere because that's what everyone else does and you'll get confused between activation functions. You'll make models and models that work, but there's a big difference in working models and knowing where to expect your models to fail and understanding their limitations.
I feel a lot of people just look at test set results and expect that to mean that the model isn't overfitting. (not to mention tuning hps based on test set results)
If you don't have math you're going to hoodwink yourself into thinking you can get to AGI by scale alone.
There are very smart people who think we can get to AGI by scale alone - they call that the "the scaling hypothesis", in fact. I think they're wrong but I thought they knew a fair amount of math.
What math would you use to describe the limitations of deep learning? My impression is there aren't any exact theorems that describe either it's limits or it's behavior/possibilities, there are just suggestive theorems and constructions combined with heuristics.
Oh boy, don't get me started.... I first off should say that by no means do I think any of these people (at least those publishing) are dumb. You can also be a genius in one direction and a fucking idiot in another, and that's okay. Certainly describes me haha (well less on the genius side and more on the functioning idiot side. So take everything I say with a grain of salt). Don't get me wrong, scale is incredibly important and is certainly the reason for our recent advancements. But scale taking us to AGI is fairly naive to me. The idea here has a few clear assumptions being made. First is that the data can accurately explain all phenomena if the machine is capable of sufficient imputation. I just don't even know how to tackle this one because it is so well established as false in the statistics literature. Another is that RLHF is enough for alignment. I like to say that RLHF is like Justice Stewart's definition of porn: I know it when I see it. This is certainly a useful tool, but we shouldn't be naive about its limitations. Just go on any reddit discussion on what constitutes NSFW and you'll find tons of disagreement or even the HN discussions of "Is This A Vehicle"[0]. Those comments are just beautiful and crazygringo (top comment) demonstrates this all perfectly. There's a powerful inference and imputation game going hand in hand and this is the issue. There needs to be more time spent thinking about one's brain and questioning assumptions we've made. As you advance, details become more and more important. We get tricked because you can often get away without nuance in the beginning of studying something but with sufficient expertise nuance ends up dominating the discussion and you might often actually see that naivety doesn't take a step in the right direction but rather can take you a step in the wrong direction (but often moving is more important). I'll reference Judea Pearl and Ilya on this one[1]. Pearl is absolutely correct, even if not conveyed well (it is Twitter after all). His book will give a good understanding of this though.
> What math would you use to describe the limitations of deep learning?
This is hard, because there isn't as much research in it as there is in demonstrations. I wouldn't go as far as saying that there's no work, but it is just far less popular and advancements are slower. Some optimal transport people really get into this stuff as well as people that work on Normalizing Flows. Aapo Hyvarinen is a really good person to read and you'll find foundations for many things like diffusion in his works that far predate the boom. I'd also really suggest looking at Max Welling and any/all of his students. If you go down that path you'll find many more people but this is a good place to enter that network.
But honestly, the best math to get started on to learn this stuff isn't "ML math". It's statistics, probability, metric theory, topology, linear algebra, and many specialized domains within these. I'd even go as far to say that category theory and set theory are very useful. It's all that math that you learn for a lot of other things, but you just need to have the correct lens. There is a problem in math education that we're often either far too application focused or too abstract focused that we forget to be generalist and have that deeper understanding[2]. But this is a lot and I'm not sure of a good single resource that pulls it all together in a way good for introductions (this paper certainly has many of the things I'd mention but it is not introductory). After all, things are simpler after they are understood.
I've written a lot and feel like I may have not given a sufficient answer. There's a lot to say and it is hard to convey in general language to general audiences. But I think I have given enough to find the path you're asking about but just wouldn't suggest you're going to get a complete answer in a comment, unfortunately (maybe someone is a better communicator than me)
[2] I think the theory focused people do often understand this more, but that's usually after going through the gauntlet and likely isn't even seen by them along that journey and especially prior to the point where many people stop. Certainly Terry Tao understands how math is just models and something like "the wave equation" isn't specifically about waves and far more general. You'll also find a lot of breaktroughs where the key ingredient is taking something from one domain and shoving it into another. Patchwork is often needed but sometimes it gets more generalized (or they derive a generalization, then show that the two are specific instances of that general form).
ML researchers saying they need "category theory" sounds like a way to try to convince mathematicians that their work is cool.
You absolutely do not need category theory.
The parent didn't say category theory is necessary to conducting ML research, just that it could be useful. This point isn't particularly controversial. If you're interested in this niche of the field, I find Tai-Danae Bradley's work to be pretty cool! She has a site: https://www.math3ma.com/
Thanks for the reply. I'm glad my comment is no longer flagged.
What do you mean that "this point isn't particularly controversial?" If you just mean that "X may be useful", then of course. But the particular X matters, and "could be useful" is much different than "is useful".
People who like category theory want it everywhere. I don't know your mathematical background, but spend any time in a math department, or even classes, and you'll find people ready to explain any topic in the language of CT.
The may be useful, but it has to be justified. It's clear in some mathematical contexts, but definitely not in ML (yet alone analysis).
ML has a problem in that no one knows what certain methods work. Just look at something like batch normalization: I can think of at least 3 different "explanations" on why it works.
ML people want explanations, and mathematicians need work. Category theorists therefore have work. But I don't think you should mistake this as being an explanation. You just get a nice get a "cleaner way" to present concepts.
FYI, I flagged you because the comment does not live up to the HN community standards[0]. A new account with just a comment to me made shortly after my comment was made just to say something sarcastic and does not contribute to the conversation. I decided to flag instead of commenting and continuing an unproductive exchange.
> People who like category theory want it everywhere.
This isn't surprising. It is an attempt at further generalization of mathematics. Albeit it can get annoying, it isn't wrong because cat theory is about looking from the high abstract level and making connections between differing branches of mathematics. If you don't see it everywhere you either don't have an understanding or have discovered something those people would really like to know. From personal experience, it can be a quite useful tool to describe things because of this.
> The may be useful, but it has to be justified.
The former begets the latter.
> Just look at something like batch normalization: I can think of at least 3 different "explanations" on why it works.
Are those the same thing? What are those?
> But I don't think you should mistake this as being an explanation. You just get a nice get a "cleaner way" to present concepts.
The latter is de facto the former.
And yes, math is just models. Or as Poincaré said, math is the study of relationship between numbers. One might also say "the map is not the territory" and you can find several math theorems making this point explicitly about math. You may even find one by reading my username with a little care. More than one if you take more care.
> If you don't see it everywhere you either don't have an understanding or have discovered something those people would really like to know. From personal experience, it can be a quite useful tool to describe things because of this.
Get off your high horse. I've had my share of Mac Lane. If you can describe something in terms of CT, you can talk to mathematicians who care about CT. I don't see why this helps ML.
> The may be useful, but it has to be justified.
"May be useful" does not beget "justified." CT may be useful in all areas if you ask a CT theorist. I fail to see how CT helps me build a car.
>The latter is de facto the former.
No it's not. You can take you favorite analysis topic and find a suitable category to view your topic from a CT perspective, but this won't tell you how to prove anything. If you did the CT correct you can now make some analogies, but it won't tell you anything specific.
> And yes, math is just models. Or as Poincaré said, math is the study of relationship between numbers. One might also say "the map is not the territory" and you can find several math theorems making this point explicitly about math.
How do you square "math is the study of relationship between numbers" with CT? You can diagram chase without seeing a single number. I have no idea what mathematical theorem you are referring to, but if you're extrapolating philosophical points from a mathematical theorem, you're doing it wrong
> You may even find one by reading my username with a little care. More than one if you take more care.
Ok I'll bite. You seem to be into Normalizing Flows. How does CT explain it being useful?
I'm trying not to dox myself so I can be more open on HN (though more concerns in modern era...). You can find some harsh words against some ML community practices in my history and I think it is easy to get misinterpreted as calling people dumb or confuse academic bashing from utility (I criticize LLMs and diffusion a lot because I like them, not the other way). So yes and no. But the lectures I have aren't recorded and public (zoom for my Uni. I'm ABD in my PhD). My lecture slides and programs should be publicly visible though, but I don't go into this with them because I've been specifically asked to not teach this way :/ In all fairness, our ML course only has Calc 1 as a pre-req and CS students aren't required to take Lin Alg (most do though, but first courses are never really that great ime) or differential equations. TBH to get into this stuff you kinda need some metric theory. If you actually poke through this paper you'll find that come up very quickly, and this is common in the optimal transport community. But I think if you get into metric theory a lot of this will make sense pretty quickly. So if you can, maybe start with Shao's Mathematical Statistics?
But the particular spin on this book makes it look to non-experts that this is the math you need to do something useful with deep learning. And that's just not true.
Certainly you need to understand what you're optimizing, how your optimizer works, what your objective function is doing, etc. But the vast majority of people don't need to know about theoretical approximation results for problems that they will never actually encounter in real life, etc. For example, I have never used used anything like "6.1.3 Lyapunov-type stability for GD optimization" in a decade of ML research. I'm sure people do! But not on the kinds of problems I work on.
Just look at the comments here. People are complaining about the lack of context, but this is fine for the audience the book is aimed at. It's just the average HN reader.
I think it would be better if the authors chose a different title. As it stands, non-experts will be attracted and then be put off, and experts will think the book is likely to be too generic.
Yeah I would have a very hard time recommending this book too. It is absurdly math heavy. I'm not sure I've even seen another book this math dense before and I've read some pretty dense books targeting review. So I'm not even sure what audience this book is aimed for. Citations? And I fully agree that the title doesn't fit whoever that audience is.
>If you don't have math you're going to hoodwink yourself into thinking you can get to AGI by scale alone.
There are many researchers who "have math" and still believe this.
Appeal to Authority is a fallacy at the best of times but it's usually a convincing one. Not so much when the authority hasn't formed consensus on the appeal.
Describing it as "moral support" really sells it short.
Imagine computer science without sorting algorithms, search algorithms, etc that have been proven correct and have known proven properties. This math serves the same purpose as CS theory.
So yes, if you're just fitting a model from a library like Keras, you're not really "using" the math. If you're working with data sets below a certain size, problems below a certain level of complexity, and models that have been deployed for many years and have well studied properties, you can do a lot with only a cursory understanding of the math, much like you can write perfectly functional web apps in Python or Java without really understanding how the language runtime works at a deep level.
But if you don't actually know how it works, you're going to get stuck pretty badly if you encounter a situation that isn't already baked into a library.
If you want to see what happens when you don't know the underlying math, look at the current generation of "data science" graduates, who don't know their math or statistics fundamentals. There are plenty of issues on the hiring side of course, but ultimately the reason those kids aren't getting jobs is that they don't actually know what they're doing, because they were never forced to learn this stuff.
In the latter part of the book that covers PINNs and other PDE methods, it helps to frame these using the same kind of functional analysis that is used to develop more traditional numerical methods. In this case, it provides a way for practitioners to verify the physical consistency between the various methods.
According to the abstract it covers different ANN architectures, optimization algorithms, probably backpropagation.. so um yes? That is stuff anyoke in machine learning uses everyday?
First time I've seen one of these books where I wished there was more words and less math. Usually it is quite the opposite. But this book seems written as if they wanted to avoid natural language at all costs.
This is in Tensorflow.
Would rather see a numpy version or something along those lines so that students can better understand what each step looks like in code.
I concur on the comments noting lack of explanation for the notation/lemmas/proof.
I second this. Numpy would be the way to go, so students can switch to JAX or PyTorch trivially. Or they could use a mix, starting with numpy, build the layer from scratch, then hand over the abstraction. Pyro would be really good for this too
I don't think the content of the comments in this thread is limited to ML. I think there is lot of applied math research out there (almost all of it?) that hardly anyone outside of academia actually reads.
I think there's some useful stuff but my impression is that research papers are mostly dead ends so I stick to graduate textbooks. Maybe other people have other approaches? I'm not a math researcher so I don't need to be at the cutting edge.
It's hard to call comprehensive. Transformers - one page. A picture would be nice. No "prompt engineering", no "double deep". In fact words "prompt" and "double" aren't used at all. "Recognition" is used only once outside of bibliography just for reference. Looks like theory will not catch up with practice any time soon. With looming singularity it's bit worrying.
Right in the title it explains it’s a book on theory. “Prompt engineering” doesn’t really fit in any theoretical framework I’m aware of and, while I also like graphics, most theory publications are light on them. You might be looking for a different kind of book, which is fine, but I think the content matches the title.
Also, a book on theory is going to lag quite a bit in terms of topics. The general process is that people discover something new and interesting empirically and publish articles on it. Other people develop theory explaining why that things work and publish articles on it. Once the theory gets crystallized, big ideas get distilled into a book.
It's an introductory book though? I don't think it aims at being comprehensive
That said, i do agree that more on transformers would be nice since they're becoming quite central in every field of machine learning.
Prompt engineering is extremely new, vastly empirical, and theory on is still only beginning (though i do remember seeing some nice papers passing). It would probably be a mistake to include it in an introductory book
I have never heard of "double deep", what is that?
It looks to me like most of the space is taken up with a plot of the sine function and the python code to generate the plot. Maybe it's a little fluffy, but it might be good for somebody self-taught, or a young person learning all of this stuff for the first time and wanting a quick reference.
There's a lot to critique but this is a really weird one (page 49 if anyone is following). The whole thing is 5 sentences and all the space is because a diagram and code block. The 5 sentences should be the thing to complain about
I only skimmed but I get the impression that sort of thing is common in the text.
I think it's got the problem that deep learning "isn't really math" - in the sense that deep learning using indeed very elaborate computational structures that can be specified mathematically but it doesn't prove theorems about them - not theorems that characterize what's happening. The theorems are just hints about what might be happening.
The key deep learning knowledge is in papers that basically only show that X approach works best on Y (plus maybe some suggestive theorem) - for example Attention Is All You Need.
> The theorems are just hints about what might be happening.
Isn't this true everywhere? Certainly it is just, in the words of Asimov, the relativity of wrongness. I mean even physics is "just a hint" despite being an incredibly strong one. I think maybe a lot of people might not agree but I think a lot of people aren't as aware of all the research that still goes on in every day physics. Like studying ocean waves/currents, wind, explosions, materials, and so much more that is not quantum or relativity. But quantum and relativity get far more attention, so perception bias.
> The key deep learning knowledge is in papers that basically only show that X approach works best on Y (plus maybe some suggestive theorem)
I very much disagree. Those are certainly the most visible, but not the most foundational. Actually I believe this approach is holding us back, and diffusion is my best example of this. Big steps in diffusion and GANs were made around the same time, but GANs were easier to implement and less resource heavy. Sohl-Dickstein certainly is a key player despite Ho being more well known. Same with Aapo Hyvarinen. I think we got too captivated by GANs that it made it harder to publish anything else. I've had some experience with this personally, where I've given up trying to publish in Normalizing Flows because reviewers will ask why my works are not better than GANs (or now diffusion) despite being better than other Flows or even got this on a distillation based paper (multiple times before we abandoned it). If there's too heavy of concern on metrics (not using as guides/hints, but as targets) then how can other things advance in a normal way? You'd have to take leaps and bounds instead of incrementalism (which we've established is fine for popular paths. Former GANs, now diffusion). Leaps and bounds because the community size is exceptionally disproportionate and that is far more time and research being put into one than another. I'd argue that we have pretty good evidence to believe a hypothesis that counterfactually diffusion would have emerged as a strong player sooner if this weren't how we measured publication criteria (SOTA chasing). I believe this problem has only become worse. But this is how technology always advances, it isn't one technology getting better and better, but we see the composition of different technologies. Almost always where the replacement starts out as significantly worse than the existing status quo. So I'd argue we're leaving a lot of good work on the table by doing this. Certainly we have enough people working in ML that we can adequately do both, which is certainly much more optimal. You need both, but problem is we just compare <new thing, or not as established thing> to <current popular and SOTA thing> as if benchmarks are the only component of the story here.
> > The theorems are just hints about what might be happening.
> Isn't this true everywhere? Certainly it is just, in the words of Asimov,
> the relativity of wrongness. I mean even physics is "just a hint" despite
> being an incredibly strong one.
Well, sure, there is a relativity wrongness but the relativity to a context and in a given context, an agent (say you or I) has to judge whether the relative difference in the wrongness of two things means they're the same or they are different. In the context of the ideal, the laws of physics are limited. Relative astrology or other new age theories they're essentially true.
So, expanding my point, relative to many contexts, the distinction between a system you can reason about and one you can't tends to be a big distinction, even if you have mathematical analogies. A rocket can be send to the moon because we can reason about the laws of physics. A self-driving car, after also many years of trying and an interactive map etc, can often but not always get to the other side of town.
>...where I've given up trying to publish in Normalizing Flows because reviewers will ask why my works are not better than GANs...
Your efforts seem like the exception that proves the rule.
People tend to hate things they don't understand more than the things that are wrong. The comments under reflect to some degree about this observation.
I have noticed that most HN discussions on math topics seem to devolve into complaints about notation. There seems to be a fairly large contingent of people from the programming side of things who don't have a formal background in mathematics.
I get the feeling that it isn't just frustration from a lack of understanding. If the topic was from some other technical field like organic chemistry or medicine I don't think you'd see this kind of response. I believe where the frustration comes from is adjacency: people have an expectation that they ought to be able to understand it because programming (via computer science) is adjacent to math. This expectation combined with a lack of understanding is what leads to cognitive dissonance. And that cognitive dissonance is what leads to the complaints about notation.
The notation isn't the problem. Every field has its own notation, jargon, and conventions. Mathematics notation is very simple and terse but the underlying concepts can be very abstract. Understanding comes from a lot of mathematical practice, not a glossary of terms. To paraphrase Euclid: "there is no royal road to mathematics (geometry)."
Not just understanding, but how to parse it. For example, I can read a theorem that introduces a bunch of variables in the first sentence that don’t get used until a few lines later and say “yeah those are probably just some constants they’ll use in an upcoming equation for an upper bound or something.” Before I was familiar with higher math, I’d say “what the hell are these where did they come from?” After reading enough books and papers, you can see a few steps ahead a lot of the time due to past experience (similar to what veteran chess players do). I can also jump to the middle of the document and see some notation that was defined earlier and guess what it is. Different authors might denote the space of continuous functions in a few different ways but they tend to be very similar for common objects.
That said, there’s a lot of poorly written theory publications out there in my opinion. A big sin to me is that a lot of them will do stuff without explaining in simple English beforehand how they’re going to do it, why they’re doing it, and why it is important. The first item you can get if you’re going through the math line by line at least, but that can be arduous and often I want to understand the big picture before I dig into the details. It doesn’t take much — just a couple of sentences before or after the result can go a long way.
A big sin to me is that a lot of them will do stuff without explaining in simple English beforehand how they’re going to do it, why they’re doing it, and why it is important.
I don't fault theoreticians for not using simple English. In many cases you're dealing with objects that are built upon a tower of abstractions with which you and your colleagues are already intimately familiar. This is true in any technical field. Sit in a hospital cafeteria long enough and I'm sure you can overhear surgeons talking shop over lunch. I wouldn't expect them to use simple English either. They have a huge corpus of terminology for every muscle, tendon, ligament, and bone in the body. Skipping past the simple English allows them to be brief and fluid in their communication style, at the expense of leaving laypeople out of the loop.
if you have no mathematical background at all this isn’t the book for you i think. that is not really advanced mathematics although a little notationally dense.
there are many good materials such as the fantastic fast.ai course that don’t require such mathematical background.
if you are motivated to learn about ML, then studying the topic can gradually be a route in to more mathematical knowledge so that equations like this would not seem intimidating.
all I am saying is every deep learning book I have ever opened is filled with mathematical stuff like this. I want to learn the mathematics for it but I need a starting point. Isnt there atleast one book in the entire world written with this in mind?
"Dive Into Deep Learning" may be good in that it usually has code alongside any mathematical notation: https://d2l.ai/index.html
I have not actually looked at it in detail, but the legendary Gilbert Strang, in addition to his classic linear algebra course, also has a course that aims to teach enough linear algebra to explain deep learning called "Linear Algebra and Learning From Data". Maybe this is also helpful.
This is because the authors aren’t trying to teach you anything. They are trying to show how smart they are and make a name for themselves with their peers. They could care less whether you learn. Almost no one doing deep learning will learn much from this.
I tried to look at your screenshot and holy hell what happened to imgur? That site used to be great for sharing images. Now it is enshittified to the max. I couldn’t even zoom in to look at the equations without some random animated GIFs popping up over the entire screen. I even use an ad blocker!
It will also make an attempt at turning the expressions into Python. It bombed out at first but caught itself and retried without any additional prompting:
Not being familiar with SymPy, though, and not having time to think it through myself, this might be a bunch of hallucinated gobbledygook. Caveat lector.
I find that programming languages is a compromise between the need to communicate between a human and the (dumb) machine on the one hand, and the need to make this communication more or less readable by humans. Mathematical notation was invented to calculate (automate reasoning) and as a way of communication between (smart) humans.
Look eg at the SGD chapter. I picked this because I think optimization is one of the areas where mathematicians actually can and do make impactful contributions to ML. But then look at the chapter in the book: most of the proofs are fairly elementary (like bias-variance decompositions or Jensen inequalities), some more interesting theorems (on convergence) are cited from the literature and do not build on the lemmata, and the sub-chapters on the actually interesting methods like ADAM,... are completely free of proofs or theory. It seems to me that after reading the chapter, a reader will have a good understanding of modern SGD methods and how we got there, but they won't necessarily be much wiser about why those methods work, other than having a good intuition confirmed by numerical experiments. If that's the outcome, then I wonder what the fuss proving all the basic stuff was all for. Wouldn't it be more useful to dedicate the space to convergence proofs for ADAM (which do exist) rather than showing lots of stuff like E(XY) = E(X)E(Y) for independent random variables?
That's just one chapter, I may not be doing them full justice here, although I did read through a few others as well. I first got this impression from the ANN chapter, which is ripe with long proofs for rather basic and uninteresting stuff, and from the physics-informed neural networks paper (which I actually find really nice, although it suffers a bit from the same problem as the SGD chapter). I don't want to be too critical here, it is nice in general to move towards a more rigorous and unified exposition of ML methods, and their approach should extend to the more technical results as well, just questioning where they drew the line of what to include and what not.