SAM 2 not only focuses on speed, it actually performs better than SAM (1), the other models instead always trade performance for speed. SAM 2 is able to achieve this result thanks to its Hiera MAE encoder: https://arxiv.org/abs/2306.00989
Does anyone have experience applying these models to rendered content (PDF's, webpages, etc). Seems like a really promising area of research to achieve LLM agents.
Doesn’t work well for screen based content in general. One of the authors of SAM2 talked about this explicitly as not being a focus of theirs as it’s not foundational in the research space in the most recent latent space pod
I appreciate this overview, but something that isn’t clear to me is how SAM 2 compares to efficient SAM and the other improvements that are based on SAM 1? Is SAM 2 better across-the-board or is it better than SAM 1 but not a slam dunk compared to efficient SAM and the others? Especially as it relates to speed and model size. Should we wait for someone to make an efficient SAM 2?
SAM 2's key contribution is adding time-based segmentation to apply to videos. Even on images alone, the authors note [0] the image-based segmentation benchmark does exceed SAM 1 performance. There have been some weaknesses exposed in areas of SAM 2 vs SAM 1, like potentially medical images [1]. Efficient SAM trades SAM 1 accuracy for ~40x speedup. I suspect we will soon see Efficient SAM 2.
Seeing some of the examples of these SAM models, I am concerned about the possibility that some military/militant group might use them to build an unjammable guided weapon (i.e. killer drone or missile). Given these models ability to apparently track objects in real time, its probably not much of a stretch to convert that into coordinates?.
Hopefully by that time there will be better defences against this type of thing, maybe a SAM powered anti-drone/anti-missile system.
Drone maybe but you underestimate the speed of a rocket.
Also computation power adds payload weight or makes your system dependent on a server side comms link.
I am not sure what the solution is but restricting these models away from open source usually just means denying access to the public, while bad actors will still find a way to use it or discover it (with just slightly more effort).
yea, but why? If existing CV works what does SAM add? You just need to spot the tank. You dont need to perfectly outline it. It is enough to just identify it.
But, how expensive are these systems? That is, the ones not vulnerable to jamming that can guide themselves independently of the operator, even if the signal is lost?
very cheap. The computer vision part is pretty basic. It is just a camera and software that runs simple object detection algos (that we had for years) that can identify tanks, trucks, soldiers, etc.