Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Segment Anything Model and Friends (lightly.ai)
110 points by sauravmaheshkar 12 months ago | hide | past | favorite | 23 comments


SAM 2 not only focuses on speed, it actually performs better than SAM (1), the other models instead always trade performance for speed. SAM 2 is able to achieve this result thanks to its Hiera MAE encoder: https://arxiv.org/abs/2306.00989


Does anyone have experience applying these models to rendered content (PDF's, webpages, etc). Seems like a really promising area of research to achieve LLM agents.


Doesn’t work well for screen based content in general. One of the authors of SAM2 talked about this explicitly as not being a focus of theirs as it’s not foundational in the research space in the most recent latent space pod


> Doesn’t work well for screen based content in general.

It's not perfect, but it works: https://github.com/OpenAdaptAI/OpenAdapt/pull/610

> the most recent latent space pod

Link: https://www.latent.space/p/sam2


We are using Segment Anything Model at OpenAdapt for exactly this purpose: https://github.com/OpenAdaptAI/OpenAdapt/pull/610

It works surprisingly well despite the fact that the model was not trained on this type of data.



I appreciate this overview, but something that isn’t clear to me is how SAM 2 compares to efficient SAM and the other improvements that are based on SAM 1? Is SAM 2 better across-the-board or is it better than SAM 1 but not a slam dunk compared to efficient SAM and the others? Especially as it relates to speed and model size. Should we wait for someone to make an efficient SAM 2?


SAM 2's key contribution is adding time-based segmentation to apply to videos. Even on images alone, the authors note [0] the image-based segmentation benchmark does exceed SAM 1 performance. There have been some weaknesses exposed in areas of SAM 2 vs SAM 1, like potentially medical images [1]. Efficient SAM trades SAM 1 accuracy for ~40x speedup. I suspect we will soon see Efficient SAM 2.

[0] https://x.com/josephofiowa/status/1818087122517311864 [1] https://x.com/bowang87/status/1821021898928443520?s=46&t=9K-...


Seeing some of the examples of these SAM models, I am concerned about the possibility that some military/militant group might use them to build an unjammable guided weapon (i.e. killer drone or missile). Given these models ability to apparently track objects in real time, its probably not much of a stretch to convert that into coordinates?.

Hopefully by that time there will be better defences against this type of thing, maybe a SAM powered anti-drone/anti-missile system.


>track objects in real time

Drone maybe but you underestimate the speed of a rocket.

Also computation power adds payload weight or makes your system dependent on a server side comms link.

I am not sure what the solution is but restricting these models away from open source usually just means denying access to the public, while bad actors will still find a way to use it or discover it (with just slightly more effort).


you dont need SAM for that. These systemw already exists in Ukraine.


But you could use it for that and would most likely get SotA results with it.

There are plenty of Nvidia Jetson boards in the ukrainian skies these days. Not necessarily for SAM but for other signal processing and CV tasks.


yea, but why? If existing CV works what does SAM add? You just need to spot the tank. You dont need to perfectly outline it. It is enough to just identify it.


It's not that different in defence. SAM2 might be more robust in some cases.

Not everything is just a guided bomb. Sometimes you might want to count and track objects over time.


But, how expensive are these systems? That is, the ones not vulnerable to jamming that can guide themselves independently of the operator, even if the signal is lost?


These systems already exists in Ukraine, and no, they are not expensive.

Simply running SAM would already be more expensive.


very cheap. The computer vision part is pretty basic. It is just a camera and software that runs simple object detection algos (that we had for years) that can identify tanks, trucks, soldiers, etc.


I would love to learn more about Grounded-Segment Anything in an article similar to this one along with the speed implications.


we interviewed the SAM2 lead author on our pod last week that goes into more detail on the technical background and challenges https://news.ycombinator.com/item?id=41185647


This is a really interesting article. Thanks a lot for sharing! :-)


Cool article, thanks for sharing!


is anyone aware of any GUI-driven tools that leverage SAM2 yet? Especially with video.


There are a bunch of demos in the form of HF Spaces:

* Pure CPU Inference for Point and Box Prompting on Images: https://huggingface.co/spaces/lightly-ai/SAMv2-Mask-Generato...

* GPU-powered Inference for Point and Box Prompting on Images: https://huggingface.co/spaces/SkalskiP/segment-anything-mode...

* Video Segmentation: https://huggingface.co/spaces/fffiloni/SAM2-Video-Predictor




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: