For people here experienced with graph databases, do you typically use the graph db as your primary data store or do you use it in combination with something like postgresql? If you're using both, can you talk about how that works and if it's been successful for you?
I'm curious because I've had a couple situations where I thought using neo4j (or some graph db) would be a natural fit for something I wanted to do, but otherwise I thought most of my other data fit into postgresql just fine. My instinct is that if I'm doing this in a web app then querying from two different databases is going to slow down my responses a lot.
It led to itself as neo4j was only useful for parts of our queries and fitting everything into neo4j was just hassle when most of our data was relational.
From your post: "The biggest issue was that we had data in the graph, that just didn’t feel right in the graph instead of a relational DB." -- What exactly was the problem other than managing complexity? I read through your posts and I didn't see any mention to some of the technical aspects of your issues with Neo4j. Was your data just so large that going the relational table route gave you a better understanding of this complexity?
Actually you can have a graph database in postgres as well! Look into queries using "WITH RECURSIVE", and you can do pretty much anything a graph database could do. From the specific use case I had, there was actually no difference in performance between neo4j and postgres. I really enjoyed using cypher, and it was a pain to translate a query written in a graph-specific query language to a postgres equivalent with "WITH RECURSIVE", but because postgres was already a part of the stack I stuck with it.
Depending on the data model there are a few ways to deal with cycles. My project involved a friendship graph, and doing queries like "find friends of X", or "find people who like X, who are within 2 degrees of friendship-separation from Y". These are problems where it was okay to have cycles in the graph, as the traversal depth was hard limited. You'll have more problems if you're doing questions like "find the cheapest path from A to B", although there is certainly a way to cope with cycles there as well.
I've got a lot of background in RDF graph stores. It depends a lot on your usage, but I think for your typical web app, you'd be better off using a Postgres install, and making use of fancier features like WITH RECURSIVE as necessary. Graph stores often miss out on features like guaranteed relational integrity and guaranteed constraints, which I find invaluable for safe application development in the face of concurrent updates.
Graph stores are typically much slower for repetitive data that fits cleanly into a relational model. This isn't to say they're not useful - for more irregular data they're a fantastic fit - it's just that very irregularly structured data isn't the common case.
Of course, you can always use two different stores - much like many sites do with a separate lucene/elasticsearch index for text search - but your graphing needs must be relatively componentised for that to work well.
Curious: What RDF triple stores have you used, and in what kind of application?
I was looking into using Stardog for a metadata repository I was building, but we ended up (probably unwisely) bastardizing Postgres into a bunch of self-join heirarchies.
The ones I've spent most time with were Jena/TDB, Virtuoso, 3store, along with a couple of proprietary engines. BigOWLIM is also a strong contender in the space. I've used them in the context of both object storage and semantic web data storage.
My experience is that if you don't need constraints/enforced relational integrity, RDF stores make for really simple/easy object storage. There's definitely a performance tradeoff, though - depends on what you need, really!
Hmm. I guess what I'm getting at is that I typically start building a web app by writing model objects that get turned into db schemas by the ORM (in Python, this is usually the Django ORM or SQLAlchemy), and the ORM turns attribute access into joins (either eagerly or lazily).
So an ORM that usefully interpreted model subclassing etc. and created self-joining tables and could query the resulting model using RECURSIVELY WITH as appropriate would be a real boon.
I don't currently use graph databases in production, but I do, however, have some experience.
I use both, in a similar vein of "using Elastic Search". It could be your primary store, but it's sometimes it's more pragmatic to have two sets, a "solid" base.
This is not to say that it can't be done. What I'm stressing is that larger "changes" are hard and difficult to handle - which means a lot in the start of your process, and less in the end, as in, when you're deciding how to model your data. For instance node layout (new properties? different type? other constraints?), and mass updates are also a bit cumbersome.
Usually I have more than one SQL table (naturally) since the data I've used in graph databases is mix and match (otherwise I'd just use a fixed schema and some relational DB).
-- As for "how that works", for me it's:
routinely update from my base database with queries alike: ID > last ID.
This has worked as expected, in terms of what data you get in, and which limitations you impose (e.g. timeliness).
I'm currently making a shift to running all data in my graph database as I've settled on a model (which edges, which nodes, which properties).
> querying from two different databases is going to slow down my responses
True, but depending on your data (do you know one of the queries beforehand - e.g. is your postgresql query enriching whatever your graph query returns) you might have success tying (inserting) some of the SQL data to your graph database.
A graph can do what a table can do and a lot more, but that's usually not the whole issue. In practice you need to consider things like speed, volume, scale, consistency, redundancy, computation, ad-hoc vs. planned operations, use of resources (disk, memory, CPU, GPU), etc. And as most NoSQL systems just aren't as mature as their table-based counterparts, you'll also have to factor in your tolerance for issues and general system crankiness. All that being said, some applications just cry out for graphs, particularly apps that involve items linked in pairs. Social apps (people linked by friendships), travel (places linked by flights), communications (people linked by messages), all of these can play hell with an SQL database but are naturals for graph databases.
I agree with the idea that tables are just strict graphs and as such a graph database is usually capable of substituting a relational database. I think many graph DBs lack a sophisticated enough query language to bridge that gap. At Orly (https://github.com/orlyatomics/orly) we're working on a powerful query language, and it's nice to see that Cayley is doing the same.
> querying from two different databases is going to slow down my responses
I think querying 2 different systems tends to be slower, but more importantly you lose transactionality. If you can use a single system that is at least on-par with your relational system for your run of the mill data and have a very powerful graph then that's a big win.
I work on a graph focused firm, XN Logic, where we use an unhydrated graph to store and analyze the relations, the appropriate store for large volumes of information, and Datomic to store mutations for the graph for history analysis.
I first started with neo4j as a primary data store for our semantic graph but there are some limitations that are forcing us to look for alternatives.
1. Adding edges to a neo4j graph is a painfully slow process. For a large graph with a few million nodes - it'll take days.
2. Scaling neo4j on a cluster is either not possible or it's a painful process. I'm yet to discover this.
However, the greatest advantage that neo4j offers is the ability to query a path. So far, no other graph databases that I know have this ability (including Apache spark and giraph).
It's quite possible to build a directed graph database as an adjaceny list in redis. We tried this and it's super fast and scalable. However, querying is very painful.
1. Adding nodes and relationships in Neo4j does not have to be slow. It really depends on how you are loading that data in. Neo4j provides many options for data import and a transactional endpoint over HTTP for batching transactions and decreasing disk write overhead.
2. The reason Neo4j is the only database that allows you to query a path is the same reason that setting up clustering or sharding is difficult. If your graph is complex then the problem is "How do I split up these subgraphs into shards so that traversals don't have to traverse across shards?" -- Building a giant adjacency list and using that as a traversal index is a clever idea, I must admit. :)
As someone else said, very much dependent on database engine. Some are faster than others, some scale better than others - it's about picking whats right for your requirements.
If you query the two databases parallel then the response time should be equal to the slowest one of the two database responses (not the sum of them).
But if you use two database then you have to maintain both of them and if they are on the same server then sharing the same resource could make them slower than just using one db.
We use an RDF datastore (OpenLink Virtuoso, clustered edition) as our primary datastore. We use it in combination with Apache Solr to provide fulltext search over various resources that we extract and pass through an Indexing pipeline to go from RDF Graph -> Search Document.
It's worth noting that Virtuoso (produced by my employer, available in free Open Source and paid Commercial variants, http://virtuoso.openlinksw.com/features-comparison-matrix/) is a hybrid Relational/Graph/XML/FreeText storage and query engine, which natively supports SQL, SPARQL, XPath, XQuery, and many other open standards. It might satisfy the OP's needs on its own.
Virtuoso's support for open standards makes it easy to use it as a complete solution covering all the bases, or, as in @philjohn's case, to plug-and-play with best-in-breed solutions along any axis where our implementation proves not to serve your needs for any reason. (We do want to know how and why we don't measure up, so we can improve that aspect!)
I'm curious because I've had a couple situations where I thought using neo4j (or some graph db) would be a natural fit for something I wanted to do, but otherwise I thought most of my other data fit into postgresql just fine. My instinct is that if I'm doing this in a web app then querying from two different databases is going to slow down my responses a lot.