I'm not sure I understand the problem specification. You want to be able to sear...

sriram_malhar · on Aug 30, 2021

No, sorry for the confusion. I want to be able to type “kanun” or “qanoon”, and have it infer the Hindi word “कानून”, which is an indexed word.

It is not necessary that there is a one to one correspondence between words. Sometimes two english words may represent one hindi word, or vice versa.

I believe I can build up a decent sized training/validation set, for example from Bollywood song lyric databases written in English, and mapping them to the Hindi equivalent (or Tamil, Bengali, etc).

As for your last question, I dont know, since I haver implemented an ML model in practice. I saw a tutorial on Bert this morning, where a word has 768 features. That itself sounds huge, leave alone the model itself.