I'm not sure I understand the problem specification. You want to be able to search "law", and find documents containing "qanoon" or "kanun", right? How does your proposed solution handle that? It seems like the approach with ML TL -> Lucene would still only find one of the two, unless your model is written to return a set of possible transliterations. Or are you saying your approach doesn't currently solve this part of the problem, and that's one of the things you'd like input on?
Is the corpus the only data you have, i.e. do you need to use it for training and validation as well?
In terms of the size of the data, if you want to store the corpus on the phone anyway, won't the index and model be relatively small in comparison?
No, sorry for the confusion. I want to be able to type “kanun” or “qanoon”, and have it infer the Hindi word “कानून”, which is an indexed word.
It is not necessary that there is a one to one correspondence between words. Sometimes two english words may represent one hindi word, or vice versa.
I believe I can build up a decent sized training/validation set, for example from Bollywood song lyric databases written in English, and mapping them to the Hindi equivalent (or Tamil, Bengali, etc).
As for your last question, I dont know, since I haver implemented an ML model in practice. I saw a tutorial on Bert this morning, where a word has 768 features. That itself sounds huge, leave alone the model itself.
Is the corpus the only data you have, i.e. do you need to use it for training and validation as well?
In terms of the size of the data, if you want to store the corpus on the phone anyway, won't the index and model be relatively small in comparison?