Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Build an open-source computer vision model in seconds using text (usezeroshot.com)
64 points by nharada on Nov 28, 2023 | hide | past | favorite | 14 comments
Hello HN! I want to share something me and a few friends have been working on for a while now — Zeroshot, a web tool that builds image classifiers using text-image models and autolabeling. What does this mean in practice? You can put together an image classifier in about 30 seconds that’s faster and more accurate than CLIP, but that you can deploy yourself however you’d like. It’s open source, commercially licensed, and doesn’t require you to pay anyone per API call.

Here's a 2 minute video that shows it off: https://www.youtube.com/watch?v=S4R1gtmM-Lo

How/why does it work? We believe that with the rise of foundation vision models, computer vision will fundamentally change. These powerful models will let any devs “compile” a model ahead of time with a subset of the foundation model’s characteristics, using only text and a web-tool. The days of teams of MLEs building complex models and pipelines are ending.

Zeroshot works by using two powerful pre-trained models, CLIP and DINOv2 together. The web-app allows users to quickly create our training sets via text search. Using pre-cached DINOv2 features, we generate a simple linear model that can be trained and deployed without any fine-tuning. Since you can see what’s going into your training set, you can tune your prompts to get the type of performance or detail you want.

CLIP Small -- Size: 335 MB, Latency: 35ms

CLIP Large -- Size: 891 MB, Latency: 276ms

Zeroshot -- Size: 85 MB, Latency: 20ms

What’s next? We wanna see how people use or would use the tool before deciding what to do next. On the list: clients for iOS and NodeJS, speeding up GPU inference times via TensorRT, offering larger Zeroshot models for better accuracy, easier results refining, support for bringing your own data lake, model refinement using GPT-V, we’ve got plenty of ideas.



Awesome project. Exactly what I was looking for.

When it produces a set of images for a given prompt, wouldnt it be better if we could remove a set of images from the possible selection ? Does it not work this way? Another idea would be to provide a few different kinds of prompts and based on that select all the images that matter for the given "class".

Some other things that would be good to know:

1. Can we keep adding items to the classifier? and getting newer versions of the classifier with the newly added item ? 2. How to deploy and host this kind of models? Is there any guidelines on how to deploy this in AWS or GCS for production use cases ?


You mean you want to remove the images because you get false positives? We've thought about dataset curation and how to manage that, ranging from full on "you can build your own dataset from scratch" to "refine results using ChatGPT/V".

Deployment guidelines are a good idea! It's fairly straightforward to deploy since it's just a Python package and you can run it via CPU or GPU. With CPU we deploy using ONNX which means the dependency list is quite small (compared to torch). For example, the part on the web app which tests your model is just deployed to AWS Lambda.

Would having us host the models be useful or something worth paying for? Obviously we couldn't offer that for free, but may be able to offer an endpoint for your model that is pay-per-call.


This is so impressive and can be used by a semi-technical person quite fast. Well done. How often does it go off the rails?


Thank you! The goal is to make it easier for devs without a bunch of ML background to add CV features to their apps. It goes off the rails when the underlying models have weird quirks, so for example if you search "ginger" you get a bunch of cat photos, since "ginger" can be a type of cat. In those cases a little prompt tuning will fix it, so "ginger root" solves it.

Did you solve your error? If it's still happening, do you have a code snippet I can try and repro?


I was naming my python file zeroshot.py which is the same as the import name so it was creating a dependency error, my bad. Also needed to update my python version. However, I'm running into this error now:

/opt/homebrew/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'

Not really sure what this means but I could do some searching.

Could also be interesting to:

- Allow base64 images

- Build in real-time video classifying

- Generate models in terminal


Ahh yeah that one is just a warning but I should see if I can improve it. It's basically saying that it's using CPU not GPU. EDIT: Fixed in 0.1.8

Good ideas too, especially base64 should be fairly easy to implement. I've wanted to try doing a video to see how it looks as well.


Also add option for multiple classifiers perhaps. Crying, Smiling, Crying AND Smiling for example.


neontomo - thanks for your enthusiasm. do you actually want to use zeroshot in python if you had a choice? or would you prefer another language or framework?

as popular as the language is getting, not everyone knows python, or has it set up.

and even if they do, arguably python is not even a good language to build real applications


Python is a good choice imo, the other one I would suggest is working it into an API but I can work around it. My use case is not there yet so don't build it for me.


Can a user upload their own images in addition to what your system already does? Like to identify all pictures of your dog rather than just dogs?


Not yet! This is something we're thinking about, but couldn't afford to offer for free since there's some amount of pre-processing/caching with GPUs required. Maybe we could instead have users somehow pre-process/host that themselves and provide hooks


That would be pretty interesting to be a way to make specific models. Local processing would work for the custom use cases.


This is very cool -- thank you for sharing, Nate


Wow this is very cool. Would love to see some of the ideas in the video demo’d




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: