Yes I definitely want to improve the search to be better. It is currently very text heavy and I (only recently) got image similarity indexing working. Hoping to leverage this to do something like you mentioned!
I'd also like to figure out how to turn an image into a description of whats in it. My ML/tensorflow knowledge is very weak though, so I still have a lot to learn here.
Have you tried something based on deep-learning that uses Transformers :
https://github.com/roatienza/deep-text-recognition-benchmark (available weights are for tasks that seem similar to OCR so there is a good chance you can use it out of the box). With a good gpu it should process hundreds to thousands image per seconds, so you likely can build your index in less than a day. (Maybe you can even port it to your iphone stack :) )
There are tons of other freely available solutions that you can get with a search for things with keywords like "image to text ocr" "transformers" "visual transformers"...
You can do better than a general image-to-text model reading memes, because they all use the same fonts - so you want something trained off synthetic data made with that font.