I'm going to guess that the difference is that Tailscale lets your machines find each other within a managed flat virtual network where as Iroh lets your applications talk to each other without any regard to which machine anything is running on.
Not sure about tailscale coordination server but once you establish connection to a headscale server, the clients don't strictly need headscale after that (although it's recommended to keep it active). So, maybe the only difference is headscale acts as a relay for once
Headscale is just a open source implementation of the Tailscale coordination server.
The coordination server just provides the IPs by which you use wireguard to connect. It can see that metadata (what machines are in a tailnet), but not anything else.
My rough guess is that they set a few workflows combining analytical and ML-based image manipulations to generate the training set. For instance, you can get a long way by having a segmentation model identify and mask various objects and then apply simple analytical manipulations to the masked areas such as changing their color, or diffusing new content into that area using masked guidance to another image diffusion model. In this way, you can create training pairs that your editing model learns to invert, such as “turn the woman’s hair into blonde hair” (start with a blonde haired woman, mask the hair, and get a diffusion model to turn it brown; this gives you the scene you can now invert as a training pair).
Image editing model training is fascinating. One method for training image editing models involves using a second model to apply the inverse of the change you want the model to learn. Typically, the task you’re asking the second model to perform is easy, whereas the inverse task is difficult.
For example, you might ask the second model to cover the person’s face with a black square; a VLM model notes that the person is a man with brown hair and round glasses. Then, during training, the resulting image is presented along with the prompt, “Remove the black square from the man’s face. He has brown hair and round glasses.”
The model now learns how to remove black squares and replace them with a man’s face with brown hair and round glasses.
Since the training data is easily synthesized using existing models, you can generate enormous amounts of it - often very cheaply. For specialized editing tasks, this technique is really powerful. Build your training set for your special purpose task, fine tune an existing image editing model such as Qwen Image Edit to produce a new checkpoint or LoRA (often a LoRA is more than good enough) and then you have a special purpose model to perform whatever narrow editing task you need it to perform on your image data.
Are these models built atop models that already understand natural language?
If the commands all follow the same syntax, it's easy to imagine how you can generate a good training set.
But how to they fully grasp natural language to be able to perform tasks worded unexpectedly, which would be easy to parse, if they understood natural language?
"But how to they fully grasp natural language to be able to perform tasks worded unexpectedly, which would be easy to parse, if they understood natural language?"
A Large Language Model. Pardon me for spelling out the full acronym, but it is what it is for a reason.
I think a lot of the whiz-bang applications of LLMs have drowned it out, but LLMs are effectively the solution to the long-standing problem of natural language understanding, and that alone would be enough to make them a ground-breaking technology. Taking English text and translating it with very high fidelity into the vector space these models understand is amazing and I think somewhat underappreciated.
Yes, the newer image and video editing models have an LLM bolted onto them. The rich embeddings from the LLM are fed into a diffusion transformer (DiT) alongside a tokenized version of the input image. These two streams “tell” the model what to do.
My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.
There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.
It depends on what you want the model to do for you. If you want the model to complete text, then you would provide the input text unmasked followed by a number of masked tokens that it's the model's job to fill in. Perhaps your goal is to have the model simply make edits to a bit of code. In that case, you'd mask out the part that it's supposed to edit and the model would iteratively fill in those masked tokens with generated tokens.
One of the powerful abilities of text diffusion models is supposedly in coding. Auto-regressive LLMs don't inherently come with the ability to edit. They can generate instructions that another system interprets as editing commands. Being able to literally unmask the parts you want to edit is a pretty powerful paradigm that could improve if not just speed up many coding tasks.
I suspect that elements of text diffusion will be baked into coding models like GPT Codex (if they aren't already). There's no reason you could not train a diffusion output head specifically designed for code editing and the same model is able to make use of that head when it makes the most sense to do so.
This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.
So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.
The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.
Text is actually one-dimensional, writing is two-dimensional.
To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.
A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.
My view of the PC dev era was through the lens of a kid growing up in the 1980s with a dad who programmed for a living. My dad was a big fan of the Borland IDEs starting with Turbo Pascal and then moving on to the world of C and C++ by the late-1980s. As a kid, my friend and I spent hundreds of hours in Quick Basic’s TUI - always trying to remake Super Mario Bros but never coming close to succeeding.
These early IDEs were fantastic at their job and worked so well given the constraints of the DOS environment of the time. It’s a shame that Borland the company eventually faded to black in 2015, but that’s how these things go. I wonder where all the geniuses behind the Borland IDEs ended up.