Extremely curious that PaLM-E, PaLI, and GPT-4 were trained to be multimodal (accept non-text inputs, such as images) but the released API's are text-only. In GCP's case, here, they've released PaLM-2 which is not multimodal like PaLM-E and PaLI. This prevents using it for visual reasoning[0].
I'm just wondering why multiple parties seem reluctant to allow the public to use this.
The image compression/decompression from their special token system wouldn't be free, it would be just as expensive as any other per-pixel transformation on an image file, and it would be entirely custom software doing it that they would have to run on their servers. Image upload and download is a very significant increase in net traffic compared to just text and could make the whole venture cost a lot more. And finally, an image even when downsized is going to be composed of a lot of tokens, so that's going to be a lot of computational cost just to run inference on it. If they haven't implemented statefulness (which many haven't right now despite the simplicity of the technique, field is still very new), that computational cost must be repeated with every fresh API call.
Basically, multi-modal functionality should be an OOM increase in compute, traffic, and storage requirements for anyone providing it compared to a text-only model (or an only-text-allowed model).
I wish they would just open the floodgates. The vultures will realize that their extractive problems won't be solved by a generative model, no matter how "multimodal" its inputs are. Of course, that won't happen, because that would require certain charlatans admitting that their models won't hold up in half the places the even more greedy vultures are vying for.
Presumably they're harder to censor or enforce ideological constraints on. I can't see any other reason other than them being worried about bad press because someone made the model do something that they want to play up as bad.
I can think of two very important reasons just off the top of my head.
1. --- It will kill captchas for good. Half of the internet is protected by Cloudflare or Google captchas at this point. Spam, fraud, and other trouble has a maximum possible volume because you can only pay a human in India so little to solve them for you. If you have an algorithm that can complete it, the game is up. Sites may as well not have a captcha at all. Prevention then becomes much more Orwellian with hardware TPM attestation solutions and the internet as we know is forever changed.
2. --- It will show corporations and governments just how all-seeing video surveillance could be. Human or (by some reports, above-human) level computer vision is a Pandora's box all by itself.
OpenAI might simply be wanting to avoid opening any more family-size cans of worms than there already are.
Quality of output is at same level as GPT.
The biggest issues for us:
1. text-bison limited to 1024 output tokens.
2. Output format we ask for JSON. But it is not valid json many times (, after last element, missing } after element etc). We have to write our own parsing code in the end to work around these JSON format issues.
I've been demoing it and have found it struggles to reliably output structured JSON at the moment. I'm curious if folks have had different experiences and if so what their prompts were.
We fine-tuned bison with input set to doc content, and output as JSON, but the generation keeps getting prefixed with some of the input. Waiting to hear back from Google about what we might be doing wrongly. The JSON itself looks great, though.
Edit: sorry, that was a different experiment. The one that worked well was an address splitter, trained off Google Address Validator output, funnily enough. Still, the output JSON got prefixed with some of the address input.
Interesting statement and would be keen to see if businesses would trust Google to try out these capabilities, or other smaller recent services as the preferred choice given their flexibility of integration with existing cloud choices.
It seems we may find companies on all major cloud providers in the near future to guarantee access to unique proprietary services that cloud providers are starting to differentiate themselves with from their competitors
I really really wonder how the price of vertex.so compares - in practice - to the openai api for use by a startup with unpredictable and non-sustained usage??? The multitenancy assumptions that are part of the openai api cost structure might make it much cheaper. Has anybody modeled this? I realize the LLM’s aren’t equivalent today, but longterm they could be.
They seem like discounting it heavily right now. I haven't seen much charge yet on my bill even though I have used is quite a lot. But things might change so not really sure how the bill will be at the end of the month.
I'm just wondering why multiple parties seem reluctant to allow the public to use this.
0: https://visualqa.org