Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A question for sd lora trainers, is this usable for making captions and what are you using, apart from BLIP?

Also, can your model of choice understand your requests to include/omit particular nuances of an image?



I like Qwen2-VL 7B because it outputs shorter captions with less fluff. But if you need to do anything advanced that relies on reasoning and instruction following the model completely falls flat on it's face.

For example, I have a couple way-too-wordy captions made with another captioner, which I'd like to cut down to the essentials while correcting any mistakes. Qwen2 is completely ignoring images with this approach, and decides to only focus on the given caption, which makes it unable to even remotely fix issues in said caption.

I am really hoping Pixtral will be better for instruction following. But I haven't been able to run it because they didn't prioritize transformers support, which in turn has hindered the release of any quantized versions to make it fit on consumer hardware.


I’m no expert but Florence2 has been my go-to. It’s pretty great at picking up art styles and IP stuff - “The image depicts Goku from the anime series Dragonball Z…”

I don’t believe you can really prompt it though, but the other models where I could also didn’t work well on that front anyways.

TagGui is an easy way to try out a bunch of models.


Yeah, blip mostly ignores prompt too. I tried to disassemble it and feed my prompts, to no avail. Although I found that default kohya gui arguments are not even remotely the best. Here's my args:

  finetune/make_captions.py ... \
    --num_beams=12 \
    --top_p=0.9 \
    --max_length=75 \
    --min_length=24 \
    --beam_search \
    ...
With this, it's very often that I just take its caption as is, or add little.

TagGui

Oh, interesting, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: