More

mbroecheler · 2025-03-27T15:26:32 1743089192

Second that. A lot of our use cases are "remote tooling", i.e. calling APIs. Implementing an MCP server to wrap APIs seems very complex - both in terms of implementation and infrastructure.

We have found GraphQL to be a great "semantic" interface for API tooling definitions since GraphQL schema allows for descriptions in the spec and is very humanly readable. For "data-heavy" AI use cases, the flexibility of GraphQL is nice so you can expose different levels of "data-depth" which is very useful in controlling cost (i.e. context window) and performance of LLM apps.

In case anybody else wants to call GraphQL APIs as tools in their chatbot/agents/LLM apps, we open sourced a library for the boilerplate code: https://github.com/DataSQRL/acorn.js

0x008 · 2025-04-06T11:24:31 1743938671

Why does it require you to build a server to wrap APIs? I think if the api uses HTTP you could just expose a SSE endpoint for the mcp client to use?

_pdp_ · 2025-03-27T20:46:54 1743108414

Oh wow. Amazing. I did not think of that. I am not fan of GraphQL but you might be onto something here. I have not checked the code and perhaps this is not the right channel for this but my read is that this library allows any generic GraphQL server to exposed in this sort of way?

mbroecheler · 2025-03-28T23:40:57 1743205257

Exactly, any generic GraphQL server can be turned into a set of LLM tools with minimal overhead and complexity.

mbroecheler · 2024-08-03T03:14:21 1722654861

I agree that being able to write one piece of code that solves your use case is a big benefit over having to cobble together a message queue, stream processor, database, query engine, etc.

We've been playing around with the idea of a building such an integration layer in SQL on top of open-source technologies like Kafka, Flink, Postgres, and Iceberg with some syntactic sugar to make timeseries processing nicer in SQL: https://github.com/DataSQRL/sqrl/

The idea is to give you the power of kdb+ with open-source technologies and SQL in an integrated package by transpiling SQL, building the computational DAG, and then running an cost-based optimizer to "cut" the DAG to the underlying data technologies.

mbroecheler · on June 27, 2023

Totally agree with the motivation - it is too cumbersome to stitch all these cloud services together by hand. Another project that's similar in motivation but focused on cloud data infrastructure is https://www.datasqrl.com/

mbroecheler · on May 16, 2023

Yes, the idea to maintain materialized views based on standing queries to make the queries instantaneous is the same. In addition, DataSQRL handles the ingest (e.g. consuming events off a queue, pre-processing the data, and populating the database) and egress (i.e. serving the data through an API) so that all your data logic can be in one place.

Another key difference to Noria is that DataSQRL is an abstraction layer on top of existing technologies like Postgres, Flink, Kafka, etc and does not aim to be another datastore. That way, you can use the technologies you already trust without having to write the integration code.

nerpderp82 · on May 16, 2023

This sounds wonderful! And it validates many of my own thoughts. :)

Your product would align nicely with these DAG recomputation engines like Fluvio and Temporal (Seattle).

Well Noria implements MySQL protocol, so if your system targets MySQL, you could run DataSQRL on Noria!

mbroecheler · on May 16, 2023

Exactly, there are so many amazing dataflow engines, stream processors, and databases out there. We are not competing with those.

We are trying to "compile away" all of the data plumbing code you have to write to integrate those systems into your application, so that it becomes easier to use them.

MySQL support in DataSQRL is definitely on the short-list.

nerpderp82 · on May 16, 2023

You support JDBC, so JDBC->MySQL Protocol->Noria should work for some definition.

My one minor nit is the creation of a new language. How does ChatGPT4 handle in reading it or writing it? It is possible to teach it a new language inside the prompt but you run out of context window.

I am not being glib, but I mapped out pretty much this exact product. The crux of your success will be in the schema discovery and versioning your schema, data and flows in a way that be tractably upgraded and downgraded.

mbroecheler · on May 16, 2023

You are totally right. We did not want to create a new language and we are trying to keep it as close to SQL as possible. The problem is that SQL lacks streaming constructs you need for temporal joins or creating streams from relational tables. Jennifer Widom's group at Stanford did a lot of work on this (e.g. [1]). We are adding their operators to SQL in a way that is hopefully "easy enough". The rest is just syntactic sugar.

But we are not tied to SQRL and totally open to ideas for making the language piece less of a hurdle.

GPT4 is surprisingly good at writing SQRL scripts with few-shot learning.

You are also right on the schema piece. We are trying to track schemas like dependencies in software engineering. So you can keep them in a repo and let a package manager + compiler handle schema compatibility and synchronization. https://dev.datasqrl.com/ is an early prototype of the repository idea.

[1] Arasu, A., Babu, S., & Widom, J. (2006). The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15, 121-142.

mbroecheler · on May 16, 2023

We'd love for you to join us in building a high-level data development language to simplify data-driven application development.

mbroecheler · on Sept 11, 2015

I suppose our communication around Titan has caused some confusion after the acquisition by DataStax. As one of the Titan devs I can say that we have no plans to abandon Titan. What we were trying to say is that we will have less time to dedicate to the project in order to encourage others in the community to step up and contribute. That has happened. Over the last couple of months, other Titan users have actively helped out on the mailing list to get newcomers started and contributed bugfixes and features via pull requests. This has allowed us to keep the Titan 1.0 release on its original plan date. What we are trying to do is make the Titan project less dependent on Dan and myself and more open and inviting to other developers who wish to contribute. For instance, we have dedicated more time than usual to reviewing PRs then before. I realize there is still more work that we need to do here but so far the increased contributions have been an encouraging sign that we are heading in the right direction. So, Titan is here to stay and - as others have pointed out - there is more momentum than ever behind the project.

mbroecheler · on Aug 24, 2015

Take a look at Gremlin 3 - it now supports both declarative and imperative queries. In fact, you can even mix and match the two.

You want to match a complex pattern? Use declarative Gremlin so the query optimizer can figure out the best execution strategy for you. You have a highly custom path traversal? Use imperative Gremlin which gives you full control over the execution and provides you with everything you'd expect from a pipeline language. You have both? Combine them in a single traversal.

While Gremlin2 was an imperative query language, Gremlin 3 is a new type of query language that aims to combine the best of both worlds.

mbroecheler · on Aug 16, 2014

Yes, it works :-) Support for multiple storage backends gives Titan a lot of deployment flexibility and allows it to inherit some great features like multi DC support. Software component reuse is pretty standard these days. What lead you to the conclusion that it is the worst of all worlds?

grizzles · on Aug 16, 2014

I think you've documented some of the issues on the Titan Limitations page. For us, we couldn't get Titan to work properly, which we documented in the Issue Tracker. That lead me to the conclusion that Titan was just too complex because from my perspective it's obvious that you guys are spreading yourselves too thin with the 7 different backends.

Why do you need both a BerkeleyDB and PersistIt backend? At the absolute most you should have 2 or 3. Single Machine, AP Cluster, ACID Cluster.

7 backends means 7 different database products, with the same API facade. Duh right? Well the problem is that constrains your API to a least common denominator feature set, limiting access to the unique attributes and capabilites of the underlying backend. Not to mention completely abstracting away memory/disk issues. This is a really big issue with your approach. You have some sunken costs here but I think eventually you will see the value in tightening up your focus.

mbroecheler · on May 14, 2013

Approximately $63 per hour on Amazon EC2.

mbroecheler · on Aug 7, 2012

I think you are looking at a very different use case here. The systems that I think you are referring to analyze a static graph representation. The Graph500 benchmark in particular loads one big static, unlabeled, undirected, property-free graph and then runs extensive (BFS) analysis algorithms on it. The fact that the graph is not changing allows significant investment into building locality optimizing data structures (which is essentially what space decomposition is all about).

Titan on the other hand is a transactional database system to handle large, multi-relational (labeled) graphs with heterogeneous properties. A Titan graph is constantly evolving (as in the posted benchmark). For graphs (unlike geo-spatial domains), applying space decomposition techniques first requires a metric space embedding which is a non-trivial and computationally expensive process. For changing graphs, this embedding will change as well making this very difficult to use in practice. The best approaches I know of for achieving locality therefore use adaptive graph partitioning techniques instead. However, for the types of OLTP workloads that Titan is optimized for, this would be overkill in the sense that the time spend on partitioning will likely exceed the time saved at runtime. At very large scale, it is most important for OLTP systems to focus on access path optimization based on the ACTUAL query load experienced by the system and not some perceived sense of locality based on connectedness. I published a paper a while ago suggesting one approach to do so: http://www.knowledgefrominformation.com/2010/08/01/cosi-clou... The Graph500 benchmark explicitly prohibits this optimization ("The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels").

jandrewrogers · on Aug 8, 2012

Good comment.

I was using Graph500 as a decently documented public example more than the only example. There are other problems based on real-world data in the trillion edge range that serve as "hello world" models for testing massively parallel graph algorithms. Directed and undirected, cyclic, and acyclic, properties and property-free. Semantic databases and entity analytics are popular test cases.

In the specific case of Graph500, the graph is significantly cyclic which creates coordination issues if you simply denormalize the data (e.g. replicating edges around a graph cut). Being able to do a massively parallel BFS from any vertex in the graph and producing the globally correct result without replicating edges means that you cannot know how to optimize the organization ahead of time. This was an intentional part of the benchmark design. The Graph 500 does not lend itself to optimizing for a particular set of adaptive graph cuts in any conventional sense; the algorithms used need to be general over the 64 randomized runs and that benchmark was designed to favor non-replicated edges when using massively parallel systems (the coordination costs of edge replicas will kill your performance). However, obviously the massively parallel systems are partitioning up the graph in some sense.

In the specific case of the work I did a couple years ago, the systems can ingest tens of millions of new edges per second concurrent with parallel queries (not serializable multi-statement transactions, obviously). The ingest structure can be identical to the structure against which ad hoc queries are run without any kind of query optimization. The fact that ingest rates that high are sustained effectively precludes dynamically reorganizing the data to satisfy particular queries more optimally. In truth, it could be made more optimal for batch-y type workloads (maybe 2-3x faster versus the dynamic version?) but the point was to be able to throw massive amounts of hardware at arbitrary graph data models rather than optimizing it for a specific query.

BTW, metric space embedding is non-trivial algorithmically but can also be computationally inexpensive. The Macbook Air I am using now can do tens of millions of those embedding operations per second on a single core for moderately complex spaces and data models. Maybe an order of magnitude or two slower if dealing with complex, non-Euclidean geometries. However, I also spent a couple years of computer science R&D developing the algorithms to make that fast. :-) I have been working in this particular area for a bit over half a decade now so my perspective takes some things for granted I think. There isn't just one problem you have to solve, there are actually several if you are starting from scratch.

Like I said, I didn't want to take anything away from Titan and true OLTP-oriented systems have their own complex problems, not the least of which is that they don't scale too far beyond a couple hundred compute nodes for the current state-of-the-art. Not my specialty. I work in a world of more basic consistency guarantees.

Cheers!