It's useful to have the plain text down the line for operations not involving a language model (e.g. search). Also if you have a bunch of prompts you want to run it's potentially cheaper, although perhaps less accurate, to run the OCR once and save yourself some tokens or even use a smaller model for subsequent prompts.
Tons of uses: Storage (text instead of images), search (user typing in a text box and you want instant retrieval from a dataset), etc. And costs: run on images once - then the rest of your queries will only need to run on text.
For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type 'read exactly what's written in this document'.