Segment Anything Model and Friends

GaggiX · 2024-08-11T19:52:15 1723405935

SAM 2 not only focuses on speed, it actually performs better than SAM (1), the other models instead always trade performance for speed. SAM 2 is able to achieve this result thanks to its Hiera MAE encoder: https://arxiv.org/abs/2306.00989

serjester · 2024-08-12T03:44:00 1723434240

Does anyone have experience applying these models to rendered content (PDF's, webpages, etc). Seems like a really promising area of research to achieve LLM agents.

dbish · 2024-08-12T06:48:10 1723445290

Doesn’t work well for screen based content in general. One of the authors of SAM2 talked about this explicitly as not being a focus of theirs as it’s not foundational in the research space in the most recent latent space pod

abrichr · 2024-08-12T14:19:37 1723472377

> Doesn’t work well for screen based content in general.

It's not perfect, but it works: https://github.com/OpenAdaptAI/OpenAdapt/pull/610

> the most recent latent space pod

Link: https://www.latent.space/p/sam2

abrichr · 2024-08-12T14:14:31 1723472071

We are using Segment Anything Model at OpenAdapt for exactly this purpose: https://github.com/OpenAdaptAI/OpenAdapt/pull/610

It works surprisingly well despite the fact that the model was not trained on this type of data.

abrichr · 2024-08-12T16:44:48 1723481088

Example on Excel: https://x.com/OpenAdaptAI/status/1798502003045548480

OkGoDoIt · 2024-08-11T22:02:59 1723413779

I appreciate this overview, but something that isn’t clear to me is how SAM 2 compares to efficient SAM and the other improvements that are based on SAM 1? Is SAM 2 better across-the-board or is it better than SAM 1 but not a slam dunk compared to efficient SAM and the others? Especially as it relates to speed and model size. Should we wait for someone to make an efficient SAM 2?

rocauc · 2024-08-11T22:32:23 1723415543

SAM 2's key contribution is adding time-based segmentation to apply to videos. Even on images alone, the authors note [0] the image-based segmentation benchmark does exceed SAM 1 performance. There have been some weaknesses exposed in areas of SAM 2 vs SAM 1, like potentially medical images [1]. Efficient SAM trades SAM 1 accuracy for ~40x speedup. I suspect we will soon see Efficient SAM 2.

[0] https://x.com/josephofiowa/status/1818087122517311864 [1] https://x.com/bowang87/status/1821021898928443520?s=46&t=9K-...

aussieguy1234 · 2024-08-12T01:20:15 1723425615

Seeing some of the examples of these SAM models, I am concerned about the possibility that some military/militant group might use them to build an unjammable guided weapon (i.e. killer drone or missile). Given these models ability to apparently track objects in real time, its probably not much of a stretch to convert that into coordinates?.

Hopefully by that time there will be better defences against this type of thing, maybe a SAM powered anti-drone/anti-missile system.

DoingIsLearning · 2024-08-12T05:40:02 1723441202

>track objects in real time

Drone maybe but you underestimate the speed of a rocket.

Also computation power adds payload weight or makes your system dependent on a server side comms link.

I am not sure what the solution is but restricting these models away from open source usually just means denying access to the public, while bad actors will still find a way to use it or discover it (with just slightly more effort).

victorbjorklund · 2024-08-12T06:20:45 1723443645

you dont need SAM for that. These systemw already exists in Ukraine.

KeplerBoy · 2024-08-12T11:57:56 1723463876

But you could use it for that and would most likely get SotA results with it.

There are plenty of Nvidia Jetson boards in the ukrainian skies these days. Not necessarily for SAM but for other signal processing and CV tasks.

victorbjorklund · 2024-08-12T14:36:46 1723473406

yea, but why? If existing CV works what does SAM add? You just need to spot the tank. You dont need to perfectly outline it. It is enough to just identify it.

KeplerBoy · 2024-08-12T17:47:22 1723484842

It's not that different in defence. SAM2 might be more robust in some cases.

Not everything is just a guided bomb. Sometimes you might want to count and track objects over time.

aussieguy1234 · 2024-08-12T07:51:21 1723449081

But, how expensive are these systems? That is, the ones not vulnerable to jamming that can guide themselves independently of the operator, even if the signal is lost?

tucnak · 2024-08-12T08:41:39 1723452099

These systems already exists in Ukraine, and no, they are not expensive.

Simply running SAM would already be more expensive.

victorbjorklund · 2024-08-12T14:35:14 1723473314

very cheap. The computer vision part is pretty basic. It is just a camera and software that runs simple object detection algos (that we had for years) that can identify tanks, trucks, soldiers, etc.

caycecan · 2024-08-12T13:00:54 1723467654

I would love to learn more about Grounded-Segment Anything in an article similar to this one along with the speed implications.

swyx · 2024-08-12T03:58:36 1723435116

we interviewed the SAM2 lead author on our pod last week that goes into more detail on the technical background and challenges https://news.ycombinator.com/item?id=41185647

MattyMatt · 2024-08-12T15:00:32 1723474832

This is a really interesting article. Thanks a lot for sharing! :-)

joelio182 · 2024-08-12T15:48:47 1723477727

Cool article, thanks for sharing!

thefroh · 2024-08-12T07:50:57 1723449057

is anyone aware of any GUI-driven tools that leverage SAM2 yet? Especially with video.

sauravmaheshkar · 2024-08-12T08:53:57 1723452837

There are a bunch of demos in the form of HF Spaces:

* Pure CPU Inference for Point and Box Prompting on Images: https://huggingface.co/spaces/lightly-ai/SAMv2-Mask-Generato...

* GPU-powered Inference for Point and Box Prompting on Images: https://huggingface.co/spaces/SkalskiP/segment-anything-mode...

* Video Segmentation: https://huggingface.co/spaces/fffiloni/SAM2-Video-Predictor