The BloomSearchEngine takes a TokenizerFunc so you can determine how JSON values are tokenized (that's why each path always returns an array of strings).
So {"name": "John Smith"} is tokenized to [{Path: "name", Values: ["john", "smith"]}], and the bloom filters will store:
- field: "name"
- token: "john"
- token: "smith"
- fieldtoken: "name:john"
- fieldtoken: "name:smith"
The same tokenizer must be used at query time too.
Fuzzy searches and sub-word searches could be supported with custom tokenizers (eg trigrams, stemming), but it's more generally targeting the "I know some exact subset of the record, I need all that have this exactly" searches
The default tokenizer is a a whitespace one: https://github.com/danthegoodman1/bloomsearch/blob/148a79967...
So {"name": "John Smith"} is tokenized to [{Path: "name", Values: ["john", "smith"]}], and the bloom filters will store:
- field: "name"
- token: "john"
- token: "smith"
- fieldtoken: "name:john"
- fieldtoken: "name:smith"
The same tokenizer must be used at query time too.
Fuzzy searches and sub-word searches could be supported with custom tokenizers (eg trigrams, stemming), but it's more generally targeting the "I know some exact subset of the record, I need all that have this exactly" searches