Hacker Newsnew | past | comments | ask | show | jobs | submit | mfn's commentslogin

I believe the key realization here is that we're applying a rotation matrix as part of the encoding. Why does this work? I've found that it's helpful to consider just the two dimensional case. Say we have two vectors. When designing a positional encoding scheme, what we're trying to do is somehow modify each vector so that it contains some information about its position.

The idea is that we can just rotate the vector by an amount proportional to its position - this has the property that the dot product of two vectors encoded this way only depends on the difference in positions, not the absolute positions themselves (since the dot product is based on angle between the vectors). And we care about the dot product, since that's what the attention operation ultimately applies to the vectors.

I've written up a _somewhat_ first principles derivation of this here: https://mfaizan.github.io/2023/04/02/sines.html, if interested!


Sinusoidal positional embeddings have always seemed a bit mysterious - even more so since papers don't tend to delve much into the intuition behind them. For example, from Vaswani et al., 2017:

> That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).

Inspired largely by the RoFormer paper (https://arxiv.org/abs/2104.09864), I thought I'd write a post that dives a bit into how intuitive considerations around linearity and relative positions can lead to the idea of using sinusoidal functions to encode positions.

Would appreciate any thoughts or feedback!


(2/2) > I have no idea what's going on here. Why would you measure different points of the field with different coordinate systems and expect sensical results? I'm imagining a surveyor walking in a line starting from the origin: he takes a measurement at the origin, then at (1,0), then at (2,0), then at (3,0), etc. (Imagine that the underlying field is frozen in time so we aren't dealing with the Lagrangian yet.) Since we know the field equations we can predict what he'll measure at each of those points in the line.

> But if the coordinate system changes with every step, he's still moving in a straight line as seen from a bird flying overhead, but at his first step he's at (1,0), then at the next step (2,0) turns into (5,1), then at the next step (6,1) (aka (3,0)) turns into (12,-3), etc, because the coordinate system changes each step. It's still (1,0),(2,0),(3,0) if you measure in the original coordinate system. But the underlying field wouldn't change in that case. Sure, if you put (5,1) into the field equation you'll get a different result than if you put in (2,0), but if you're only changing the coordinate system then that has to be compensated for in the field equation itself and you're not going to get different results for the same physical point. I mean, you should get the same result if you do f((2,0), coordinate system a) as if you did f((5,1), coordinate system b)

So I guess a simpler way to see this is to note that changing coordinates should never affect the results of a physical theory. For example, if I measure things with the origin at x = 0, but you use x = 5 - all our measurements of position will differ by 5 units. But when we apply the equations of motion to some object we're both looking at - F = ma, our predictions of how the object will move will agree. My predictions will be the same as yours, but the positions will differ by 5 units. This is because the equation of motion, F = ma, does not care about translations in the coordinate system, the acceleration is the second derivative. If we had an equation like F = ma + x, then we would no longer be coordinate invariant, and the equation would be unphysical. It wouldn't make sense, you could look at the system in a different way (i.e. using different coordinates) and get completely different results.

> Why is it called that? I assume someone chose that name because it made sense to them for good reasons

This I'm not sure about - I haven't actually seen the reason mentioned anywhere, other than just an indication that it's for some historical reason.

> Wait, "our field" and "the new field"? What fields are those? We were talking about a field defined by the function ϕ(x,t) and thinking about its Lagrangian. We added a term A to the Lagrangian and that was it. What's the "new field"? Why does ϕ(x,t) have an interaction with it? Is A the new field?

Yup! A is the 'new field' - the terminology could be clearer here. So by requiring that our toy Lagrangian with phi be gauge invariant, we now have a need to introduce another field, A, and way this field appears in the Lagrangian (by multiplying by phi) is something that will end up acting like a force.

> What is "the way we're measuring it"? I think it's, basically, the coordinate system of the surveyor changes each step, so that's a different "way" of measuring it? I still don't see why changing the coordinate system makes new stuff pop out of the equation

Yes - by way of measuring it I mean that we are using a different coordinate system at each point, just to see what the effects of doing so are. And this is the magic bit - if we don't have this other field A, then the equation will produce different results depending on which coordinate system you use. So any theory without A will not consistently give the same results regardless of coordinate system.

> How do you expand it? Are you giving the definition of W(x, x+δx)? How'd you get that? What's O(δx^2)?

Power series expansion. So we are defining W(x, y) as a function that allows us to compare the field at x and y. We don't know what this function is, we just assume it exists. We then expand it out in powers of delta x. Since we are eventually going to take the limit, any higher order term - that involves delta x squared or higher - will be too small to matter, so we just care about the constant term and the linear term.

> Where'd the second set come from? Wait, I see, it's just saying that you can say that "multiplying a number by a 1x1 matrix" is the same thing as saying "you can compose a number with a rotation". It's literally the same thing, just said in a less clear manner. Does the new terminology get us anything useful?

Once we make the connection between a symmetry of our Lagrangian and some abstract group like SU(3), we can immediately bring in group theoretic results about that group. For example, since we know (from group theory) that SU(3) has eight generators, we can now use that result and infer that we need eight gauge bosons to make the theory gauge invariant.

> What? How do you know how many generators there are? Why does SU(2) have two generators but three gauge bosons?

Typo - will fix! Should be three generators.

> Yeah, I think the part I'm not getting is how changing coordinate systems affects the equation. I think I can see that if you insist on doing something ridiculous like this you'd need some math to correct for it and if the new correction functions are fields then it looks like new particles popping out, but I don't see how that doesn't result in an infinite number of new particles. Like, I can add a function f(x) = x^2 to the Lagrangian, then a g(x) = -x^2 to compensate for it, but those don't represent new particles do they? Why do those cancel out but A doesn't? I just don't see how changing coordinate systems results in different results

So anytime something gets added to the Lagrangian, you effectively have a new theory - it's a proposal. The point here is that you don't really have infinite flexibility in adding things to the Lagrangian - if you add complex scalar fields, then you must also add gauge bosons. So symmetry doesn't constrain everything - you still need to figure out what the Lagrangian should be, but it'll force you to add other fields as well to make things gauge invariant.

> Despite my questions, I think I have a better idea of what's going on. You have a function; it should spit out the same numbers when you rotate it; you need a function to correct for the rotation; in physics the new function looks like a particle. I can kinda sorta see how it works now. Thanks for the article!

Appreciate the thorough review! Lots of things that I should have been more thorough about - I will fix :) Thanks again.


Thanks again for the article! I did learn a lot from it, and I really appreciate your time answering my questions


(1/2) Hey thanks for the in depth review!

> The next couple of examples only minimize the Lagrangian; are there any systems in this article that maximize it?

So it's not really relevant whether we minimize or maximize it - AFAIK, the action principle states that any path that maximizes OR minimizes the Lagrangian would be a path that the system could take. I'm actually not sure why it so happens that there's always a unique path in the theories that physicists use - I'm guessing that the argument could be that a Lagrangian that gives multiple paths would be unphysical? Not sure here.

> What in this case are the objects? Just particles? Is mass in the first example (of classical motion) an object? I'm trying to figure out what kind of objects to use to build an equation, or basically, what type (in the programming sense) an object is

I'm using objects as a superclass/base class (in the programming sense) of both fields and particles. In classical mechanics you have two things - particles, where particles are described by their position (along with derivatives). You also have fields (electric, magnetic) that are described by the amplitude of the field at all points in space (along with derivatives).

> Why? Are there other theories, maybe from earlier in the development of physics, that used a different approach?

It's surprising at first glance since it's hard to imagine an intuitive reason for why every theory we've developed, classical mechanics, quantum mechanics, quantum field theory - can be reformulated in a way where some function L is being minimized or maximized. I suppose there could be theories that can't be framed in such a way, but AFAIK all theories 'in use' can.

> I assume that in this passage "building blocks" is equivalent to "objects" in the last passage? Why are fields more useful? Is there an example of what using particles as objects would look like? In particular, a field looks to me like a function; if you used particles as an object, would you represent a particle as a function using its position and velocity? Would that function have time as a parameter like fields do? Typing that out I can kind of see why you'd use fields

This is a really good point and something I definitely should clarify. So this jump from particles to "everything is a field" happens when we jump from classical mechanics to quantum mechanics (and quantum field theory). In quantum mechanics you never really know the 'position' of a particle anymore, it's not defined - a particle is represented by a probability distribution over all space, which is where its field nature comes in - you're assigning a probability density to each point in space.

This is definitely something that I hand waved away for simplicity, but can definitely see how this is confusing.

> What does the output represent? Anything in particular? If not, then it seems like you could define the field function to be anything since the output doesn't represent anything, then when you feed the field function into the Lagrangian eventually you'd get massively different results

So what the output is is completely up to you, as the 'author' of the theory. If the field is a complex number representing the probability density of the particle, then you'll end up with a theory of scalar particles, such as the Higgs boson. If the field represents spinors (vector like objects that transform differently), then you get electrons. If the field spits out vectors, then you get a theory of the electromagnetic field.

You bring up an interesting point - we can have fields with arbitrarily exotic objects, so which ones do we use? I believe this just comes down to experiment.

> How is the Lagrangian constructed? This "simplest" Lagrangian is the derivative of the field with respect to time, along with the derivative of the field's complex conjugate with respect to time, but how'd you know to do that? What makes this the simplest possible Lagrangian? Calling this the "simplest Lagrangian" hints that there are other equally valid ways to create a Lagrangian; is that correct? What are the rules for that? Why would you make a more complex Lagrangian?

Another really good point. Schwichtenberg's books flesh out the argument in more detail, but you're right that there are many ways to create a Lagrangian, in principle, any theory you come up with that follows the action principle can be (by definition) expressed by a Lagrangian. L can be whatever you want.

Now, why this particular choice of L? One argument is - let's assume that L can be expanded out in a series. In that series we'll have terms with time derivatives of the field, and terms that involve just the field. This is where the 'simple' part comes in - we're chopping off all time derivatives except the first derivative, multiplied by nothing else. The only other consideration is that the term with the time derivative needs to be of even order, because otherwise you will not have a stable theory. So this leaves the 'simplest' time derivative terms as (d/dt)(phi) squared, or multiplied by the complex conjugate.

The key thing to note is that there aren't really precise rules the conclusively dictate exactly what the Lagrangian must be. The Standard Model uses a set of terms, and there are some heuristic reasons for why those terms are used (Lorentz invariance being an imporant one), but at the end of the day - you can build any Lagrangian you want for your theory. The one used in this piece is just a simple Lagrangian has enough 'interestingness' that it somewhat reproduces the effects of how symmetry influences the actual, full scale Standard Model Lagrangian.

> What is V(ϕ)? My initial assumption would be velocity, but how do you take velocity of a field? Actually, I can see what they're doing: velocity of a particle is the derivative of it's position with respect to time, so I guess V(ϕ) aka velocity of the field is the derivative of the field with respect to time. That could've stood to have been spelled out

Apologies for the confusion here - this just represents any function that doesn't depend on the time derivative of phi. It's called potential energy since it depends on the value of the field, so where it 'is', not how it's changing.


Yes, that's an excellent book, along with his book on QFT.

I also can't recommend this course enough, Susskind has done a remarkably good job at making advanced physics concepts accessible: https://theoreticalminimum.com/

Also, "Symmetry and the Standard Model: Mathematics and Particle Physics" by Matthew Robinson does a great job of developing the group theory needed before diving into the physics.


Yeah I wasn't sure how to set things up there - if I kept a single space dimension + a time dimension, then I'd have to explain the negative sign on one of the terms, and probably also talk about the Einstein summation convention to keep things clean. Whereas with a single time dimension, it's not really 'spacetime' as you pointed out.

What motivated this post was that I wanted to give a concrete example of what it really means for some symmetry to 'dictate' the structure of a physical theory, but do so in the simplest way possible - i.e. not deal with spinors, gamma matrices, quantum fields - and the rest of the actual machinery of the standard model. The core idea is so profound that I felt like there has to be a way to get a taste of it across in a way that's accessible.

Turned out to be a lot harder than I thought - I had to skip quite a few steps in the post to keep it from becoming too long, but I'm hoping the model still conveys the essence of how a symmetry + action principle can 'predict' particles.


I think what you did is standard. But what I question is, if we can solve our problem without assuming spacetime, why do we need the abstraction called spacetime? Spacetime looks like a historical quirk that physicists feel obligated to carry.

For instance, bending of the light experiment is not done in spacetime but in space and time.


But you can't rip them apart in our physical theories, in relativity different observers will have different ideas about space and time, but will agree about certain invariants if you take both space and time into account. Then a few years later, Minkowski formulated special relativity very elegantly using a four dimensional space-time. That view is basic to general relativity, where the foundation is the spacetime metric and the energy-momentum tensor.


Regarding how to derive Maxwell’s equations from QED, I’d recommend this lecture: https://theoreticalminimum.com/courses/special-relativity-an...

This derivation is in the context of classical field theory, but QED is only a short hop away through path integrals.

It’s quite remarkable how the complexity of Maxwell’s equations can be reduced to a single term in the Lagrangian - (F_uv)(F^uv), assuming no charges. That’s really it!


Thanks for the link! I'm a huge fan of Peter Woit - his blog especially.

So my understanding is that while the way Lagrangians are written in classical theory doesn't extend directly to quantum mechanics, the concept of a Lagrangian is still useful since the Lagrangian can be fed into the path integral formulation (as opposed to being used as an input to the Euler-Lagrange equations). Also, in quantum field theory the starting point for canonical second quantization is typically a Lagrangian, where the fields are changed to operator fields.

Also, something interesting I came across - the Euler-Lagrange equations do have a quantum analogue as well: https://en.wikipedia.org/wiki/Schwinger%E2%80%93Dyson_equati...


I'm really no expert, decades went by.

A while ago I read in Klaas Landsman's Book, it is very nice: "This book studies the foundations of quantum theory through its relationship to classical physics."

https://www.dbooks.org/foundations-of-quantum-theory-3319517...

(beware of the Bohr Topos)


That’s the approach I used as well in the second half of the article - I just mentioned the transformation law in the beginning since that’s what most physics students encounter first.

Most of the article tries to provide some intuition behind why multilinear maps, which sound like a fairly abstract concept, might be relevant in physics. The key link being the importance of coordinate invariance.

I didn’t go into deriving the coordinate transforms from the multilinear map definition as I didn’t feel that it’d provide much better intuition, but I did mention the equivalence near the end.


Yeah sorry you’re right - I should have read the rest of your post, which is excellent and describes precisely why the coordinates/transformations focused definition is bad for one’s intuition.


Thoughts on the switch? I decided to become a PM full-time after a couple of dev internships, thinking that given the pace of change in tech, it would be a better idea to spend time building people skills instead as those seemed to be more 'durable'.

After a couple of years, I've completely changed my mind. Though I've done well in my role, I find that it's hard to point to a concrete skill set (such as deep domain knowledge in some field), and the fact that the PM role varies so much across companies makes it difficult to transition.


It's a mixed bag. Product management and project management are broader roles that let you see the bigger picture, and let you interact with more people at your company. If you're a PM or PJM at a smallish (100-300 person) company, you'll know everyone in the company within your first year. So it's a good career move if you're more outgoing/extroverted. It's also nice to not have to constantly be running the "hot technology" treadmill to stay relevant. I can pick and choose what languages/frameworks to learn that I think will be most valuable, instead of having to dabble in everything because "Framework X is outdated--Everyone's using Framework Y now!"

Downside is you generally have to let go of direct control of what code goes into the products and let your talented engineers do it. My first few months as a PM I still wanted to commit code and had to stop myself. There is also far less demand for PMs and PJMs at tech companies. Everyone's hiring developers by the truckload, but PM positions are few and far between. The whole "it's easy to change jobs in Silicon Valley" thing only really applies to developers. Pay is also not as good as engineering (currently).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: