Good luck folks! I'm glad there are projects trying to solve enterprise search. ...

ankit219 · on Feb 23, 2024

I think search is the wrong lens to look at it. Yes, finding relevant information quickly is important, but the key to enterprise search tools would be to get a holistic view around any topic. A typical enterprise has a lot of silos (12 out of top 15 enterprise apps on G2 are addressing this problem) and the flow of information doesn't exist. Any enterprise search tool helps in aggregation and triangulation of conversations/knowledge from various sources, helping any user get the sense of 1/ what is the current state of this topic? 2/ how did they get to this state? (you would have guessed it, you need a knowledge graph for the product to work well)

(Disclaimer: we started with enterprise search too, and now we think a custom model is a better way to get to those goals) Also, the output needs to be integrated into their own workflows. Eg: Seeing a conversation on slack and using a bot fetch all the supplemental information to understand the context from say mail, docs etc. is mighty useful. Search using RAG is the easy part, the hard part is contextualizing it in a way it is immediately useful. That depends on understanding the company/domain lingo, understanding users, etc.

Private aspect is a bad UX but not a bottleneck. Remember, this is targeted towards power users looking to use it on a daily basis. Connecting the bot once to get started is fine as long as it adds value. If anyone offers data governance along with it (updating access on a daily basis and answers only from access), it could be a huge hit.

yuhongsun · on March 2, 2024

Exactly, there is a huge amount of value in being able to quickly get a holistic view of topics. Most topics don't exist in an isolated tool - most often there are the official discussions/designs which exist in place, there are customer interactions with the topic which uses a separate channel, and then there are one off small conversations about the topic in chats like Slack. So isolated, tool specific searches are great for finding specific documents, but less useful for getting actionable insights.

Regarding contextualizing: we're currently working on organizational understanding and we're very excited about this one! We're embedding users based on the documents they authored or interacted with, the questions they have asked, description of projects they worked on, and the org chart. The thinking is that, there will always be questions that can't fully be answered via documentation alone. But in those cases we'll be able to recommend someone who might know. It also has the benefit of contextualizing the user asking so that we can surface more relevant results for them.

baetylus · on Feb 24, 2024

>> the hard part is contextualizing it in a way it is immediately useful. That depends on understanding the company/domain lingo, understanding users, etc.

How do you control this deterministically? It sounds like the "hard part" is variation in prompting & selectively choosing the right data to include, both of which I could see being good enough right now but hard to deliver definitively.

yuhongsun · on March 2, 2024

Being able to filter down the data deterministically is a big value add, especially as the number of documents scale into the range of multiple millions. We have filters by document-set, tags, time range, source type (ie. only include Slack + Google Drive, or Confluence + Jira + Gong, etc.)

The challenge is with the non-deterministic portions of the flow as you pointed out. Ensuring retrieval quality in out-of-domain datasets, guardrailing the LLM generation, working with conflicting or deprecated information are some of the interesting areas we're addressing. Happy to dive deeper on any aspect you're curious about, and I'm sure we can learn from the discussion as well.

yuhongsun · on Feb 22, 2024

I think I understand your concern but if I miss the point, please follow up!

So regarding getting access to read knowledge from the different tools, it depends tool by tool but a lot of them have API keys or options for app integrations available in the free tier (GitHub, Google Drive, Confluence come to mind). Other tools don't have a free tier and you just get access to the API keys as a part of paying for the service. I think there are probably tools that require a premium fee to get integration access but I'm not aware of any personally.

For the SlackBot, it can add itself to public channels but for private channels someone needs to add it. It is what it is sadly.

About search being available for most SaaS products: SaaS tools are definitely improving their own searches. But I still think a single place to search and aggregate data has significant value. For example, as an engineer by training, often getting the full picture for some customer escalation includes reading Slack threads, Confluence Design docs, old Pull Requests on GitHub. Would be nice to get it all in one place.

BillFranklin · on Feb 22, 2024

> It is what it is sadly.

This is what I mean -- previously I built a similar search engine on top of slack, notion, etc., but didn't launch the product because I thought that requiring users to constantly add bots to private channels would be a subpar experience. I thought this would be a blocker for good UX, so didn't go further, but maybe you'll find a nice solution!

Searching over public internal data is addressed by a few existing tools, but it's the private aspect which is pretty difficult to handle and disastrous to get wrong when managed ad-hoc - e.g. someone accidentally adds the bot to a private slack group called #layoffs :) so you'd want this handled properly and centrally.

I guess you'll also need to handle privacy well, ~maybe it's OK when run as a SaaS for db admins to have access to ingested data, but if it's OSS then the people that run it probably shouldn't be able to read the private data that's ingested, so now you need to handle search over encrypted data, which is a fun problem :D

yuhongsun · on March 2, 2024

Access controls is a non-glamorous but critical piece of what we're building. Currently implementing automatic access sync-ing for a few sources like Google Drive, Confluence, Jira, and Notion to start. By matching document-access in the source to users and groups, and then to emails, we can finally map Danswer users to document level access. So someone searching in Danswer will only get results based on the set of documents they have access to in the source tool.

For Slack it would look something like: get the users in the Slack channel, map those Slack users to users in Danswer. Then only those users in Danswer will be able to get results from that channel.

nl · on Feb 22, 2024

> ~maybe it's OK when run as a SaaS for db admins to have access to ingested data, but if it's OSS then the people that run it probably shouldn't be able to read the private data that's ingested

I don't understand the distinction here. If Danswer runs a SaaS version then yes I agree they can have a license agreement that lets their DB Admins see data in some cases which is fine. That seems an orthogonal issue to if a company is running the OSS version internally, in which case presumably their administrator can see all docs (but software administrators usually can do this anyway).

Weves · on Feb 22, 2024

Yep, this is exactly correct! For our SaaS version, we do have an agreement which allows us to look at data if needed to debug issues and/or improve search performance.

For self-hosted deployments, usually a select few admins who have setup the plumbing on AWS do have access (but as nl has mentioned, these people usually have access to superuser access on the tools we connect to anyways so this is a noop).