Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.
Right but also this is a pretty common pattern in distributed systems that publish from databases (really any large central source of truth); it might be like the problem in systems like this. When you're lucky the corner cases are obvious; in the big one we experienced last year, a new row in our database tripped an if-let/mutex deadlock, which our system dutifully (and very quickly) propagated across our entire network.
The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.
(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)
When looking at multi-vector / ColBERT style approaches, the embedding per token approach can massively increase costs. You might go from a single 768 dimension vector to 128 x 130 = 16,640 dimensions. Even with better results from a multi-vector model this can make it unfeasible for many use-cases.
Muvera, converts the multiple vectors into a single fixed dimension (usually net smaller) vector that can be used by any ANN index. As you now have a single vector you can use all your existing ANN algorithms and stack other quantization techniques for memory savings. In my opinion it is a much better approach than PLAID because it doesn't require specific index structures or clustering assumptions and can achieve lower latency.
As someone living in Australia this law is ridiculous and basically outlaws a chunk of my childhood where I used forums, irc, and blogs.
As a parent I do share concerns for short duration video content like Tiktok, Reels, and Youtube shorts etc, but I think any sensible regulation there would be better suited to everyone.
Age verification has such terrible consequences to anonymity that it shouldn't be an option. From the Explanatory Memorandum it looks like the current eSafety Commissioner was involved which unfortunately explains a lot.
I'd argue that the current business model is clear (pay subscription for chat or api) but the valuations are unclear.
If open models continue to keep up and industry as a whole just keeps improving on benchmarks fairly slowly and evenly (a big assumption but what has been happening over past year), then you would assume valuations would eventually drop. Saying that there is still a lot of easy value left on the table (great voice integration etc) and a lot of room for innovations to be incorporated in other industries.
Additionally each foundation LLM provider has a non-zero chance of a large step change in performance. If they can keep the value of this step change it would have huge upsides but this is also hard to reason about (what is the valuation of private companies and the share market if AGI exists - does this even matter anymore).
Hey @rvrs, I work on Weaviate and we are doing some improvements around increasing write throughput:
1. gRPC. Using gRPC to write vectors has had a really nice performance boost. It is released in Weaviate core but here is still some work on do on the clients. Feel free to get in contact if you would like to try it out.
2. Parameter tuning. lowering `efConstruction` can speed up imports.
I work at Weaviate, a few comments on why we implemented hybrid search [1].
- Using two separate systems for traditional BM25 and vector search and keeping them in-sync is pretty difficult from an operations perspective. A combined system is much easier to manage and will have better end-to-end latency.
- For combining scores, a linear combination like this article suggests is not recommended and instead rank fusion https://rodgerbenham.github.io/bc17-adcs.pdf (where you care what each method ranks first rather than the absolute score) is used.
- The point of adding both search methods is for dealing with what researchers term "out of domain data". This is for datasets the model producing the vectors was not trained on. Research from Google https://arxiv.org/abs/2201.10582 suggests hybrid search with rank fusion helps in this case by around 20.4%. For "in domain" data, the model (usually transformer based) will out perform BM25.
- Using a cross encoder [2] is a good component to add to improve relevance. It will just though rerank the final results, so if the initial search returns 100 garbage results the cross encoder won't be able to help.
The latency of the combination of parallel systems is equal to the slowest component. And obviously, specialized tools will be faster than a component of a multi-tool system cause while dedicated engines can invest in optimizing specific functionality, multi-tool engines are stuck in the integration hell.
A solution assembled for a specific task from highly specialized components will always be more optimal than 'one-size-fits-all' pipelines. Meilisearch solves search-as-you-type better than anyone else, so why compromise? Not to mention that the scalability pattern of BM25 and vector search is entirely different.
This, by the way, is pretty obvious from the fact that you don't publish comparative benchmarks.
The cross-encoder will only rerank the results, that's right. And you're also correct that if the initial search returns 100 garbage results, it won't be able to help. But that's true for any reranking method. Even the rank fusion you use will rerank only the results returned by keyword and vector search. So what is the advantage of it over cross-encoders?
Adding a cross-encoder to your app means including pytorch/transformers and a model as a dependency. For people using OpenAI or Cohere embedding apis and lightweight infrastructure this can be a big pain.
Proposed architecture doesn't limit you to use self-hosted transformers only, you can use OpenAI just as easily. And you don't need a to install a "module" for that
There are two reasons that self-hosted is expensive, both having no relation to real costs.
1. If you are self-hosting, you have now segmented yourself into the customer category with the government, military, and large corporates concerned about security. This segment will pay more and so will be charged more. In an enterprise software company the vast majority of revenue comes from the larger customers vs many small customers, so these are the users that pricing gets optimised for.
2. Gitlab wants to be seen in the market as a SaaS company as subscription revenue is far preferable to licence revenue from a churn perspective, and also from a general trend perspective where self-hosted, hard to maintain solutions get replaced by cloud solutions.
I'm not a fan of competitors creating benchmarks like this as when faced with any tuning decision, they will usually pick the one the makes their competitors slower. But anyway lets take a look at how they tuned Elasticsearch.
- No index template configuration
This would cause higher disk usage than needed due to duplicate mappings. Again a Logstash vs Beats thing. For this test more primary shards and a larger refresh interval would also improve things.
- Graph complaining Elasticsearch using 60% available memory.
This is as configured, they could use less with not much impact to performance.
- Document counts do not match..
This is probably due to using syslog with random generated data vs creating a test dataset on disk and reading the same data into all platforms.
thanks for the note. Our approach for this benchmark was to use the default configs which each of the logging platforms come with.
This is also because we are not experts in Elastic or Loki, so we won't know the possible impact of tuning configs. To be fair, we also didn't tune SigNoz for this specific data or test scenario and ran it in default settings.
> Graph complaining Elasticsearch using 60% available memory. This is as configured, they could use less with not much impact to performance.
This is something we discussed about, and have added a note in the benchmark blog as well. Pasting again for reference
> For this benchmark for Elasticsearch, we kept the default recommended heap size memory of 50% of available memory (as shown in Elastic docs here). This determines caching capabilities and hence the query performance.
We could have tried to tinker with the different heap sizes ( as a % of total memory) but that would impact query performance and hence we kept the default Elastic recommendation
Part of the issue is, Elasticsearch isn't an open-source logging platform--it's a search-oriented database. Effectively using it as an open-source logging platform highly depends on the config vs things optimized only for logs out of the box.
I imagine you'd have similar issues with Postgres or any general purpose datastore without the correct configuration.
I'm not an Elastic expert either, just a developer responsible for a lot of things that can Google pretty good, and I knew those configs seemed off. I've been hearing for years that Beats is preferable over Logstash. I don't even claim to work in the logging space :-)
The benefit of this approach is I get to use prosumer hardware but at reasonable cost (total < $350 AUD). For the AP's I have just setup via pairing on the mobile app and use the same SSID and passwords which allows for easy roaming.
I'm contemplating upgrading to an Ubiquiti dream machine pro to replace the ER-X for more ports and ability to have video recording & security cameras but really happy with current setup from a wifi performance and stability perspective.
Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.