Like every year before, the Postgres team has blessed us with an early christmas present. And like every release post before, I'd like to use this opportunity to say thanks to the team for the awesome job they are doing year after year.
It's not just the database itself (and that's awesome on its own right), but it's also all the peripheral stuff: The documentation is seriously amazing and very complete, the tools that come with the database are really good too (like psql which I still prefer to the various UIs out there).
Code-wise, I would recommend anybody to have a look at their git repo and the way how they write commit-messages: They are a pleasure to read and really explain what's going on. If everybody wrote commit messages like this, we'd be in a much better place what code-archeology is concerned.
Patches from the community are always patiently reviewed and, contrary to many other projects, even new contributors are not really required to have a thick skin nor flame retardant suits. The only thing required is a lot of patience as the level of quality required for a patch to go in is very, very high.
Finally, there's #postgresql on Freenode where core developers spend their time patiently helping people in need of support. Some questions could be solved by spending 30 seconds in the (as I said: excellent) manual and some of them point to really obscure issues, but no matter what time it is: Somebody in #postgresql is there to help you.
I think there's no other free software project out there that just gets everything right: Very friendly community, awesome documentation, awesome tools, and of course and awesome product offering to begin with.
Huge thanks to everybody involved.
Also: Huge YAY for jsonb - I have many, many things in mind I can use that for and I have been looking forward to this for a year now.
> we'd be in a much better place what code-archaeology is concerned.
This sounds like a great setup for a sci-fi novel. 500 years into the future, the infrastructure their distant ancestors coded has begun to fail. Now Biff Miffington, code-archaeologist, must sift through millions of forgotten messages using a mysterious tool remembered only as "git." Its interface is arcane and the remaining messages broken, tainted by the destructive influence of Mountain Dew and Cheetos. Will he unravel the mystery that's causing Candy Crush Saga MCCXXXI to kill its users?
Edit: On what OP actually said, I'd also like to say that Postgres is an awesome product.
That is, btw, an absolutely fantastic distant future sci-fi novel. Among the best of all time. (Vinge doesn't consistently product books that good, but man that one is great.)
Just bought it because that premise sounds amazing. Amazon has it listed as "Zones of Thought series Book 2". Should I read book 1 before reading this?
Maybe, but I don't think you necessarily need to. I started with A Deepness in the Sky and then read the other two (and then all the other books Vinge has written).
Especially since it is actually a prequel, I don't think the order is that important.
There are things in Deepness that will have more significance if you've read A Fire upon the Deep, but nothing central to the plot. They're both standalone novels, just in the same setting/universe/background.
Testing the waters, I'd played with the idea of a story with basically this setup:
Two new developers start at a company/startup, and are brought in to do a six-month sprint to fix a stalled/broken product after the development team becomes unavailable because reasons. So, they're dropped into a codebase and are trying to pull everything together.
However, as they work through into deeper and deeper parts of the system, the commit messages and comments get more and more cryptic and unsettling, hinting at the reasons for the prior team's dissolution, the business forces that caused that to happen, and maybe something worse/better going on outside.
I like it. It reminds me of something like System Shock (2 is the one I played), where you learn about what happened on this ship before it was overrun by mutants from journals and such. Instead of a book, this may work well as a game of some sort. Perhaps even text-based?
In Vernor Vinge's Deepness In The Sky, there really are code archeologists, as it's often simpler to find programs than to write them from scratch.
They believe that the Unix epoch (since extended many times over) dates to the moon landing (approximately correct), since the exact historical explanation has been lost.
When I say "code archeology", I mean the equivalent of "git blame" without assigning blame - just finding out why stuff is the way it is and not who's responsible because that would be totally pointless, especially years after the incident.
Oh yeah, I knew your meaning. Doing blame and bisect to find a deep-seated bug can be a lot of fun... or frustrating and miserable. I tend to give terrible on word commit messages when I'm developing, but I always do an interactive rebase to clean things up and make a reasonable view of history. We use Phabricator a lot too, which does a great job of letting you annotate code, commits etc.
Heh, the twist is that he discovers the secret, which is that Candy Crush is really a giant conspiracy to cull the feeble. He decides not to fix anything.
In addition to those peripheral benefits there's also the intangibles. Mainly that worrying about database stability is just not part of my life the way it was five years ago before I'd fully migrated to PostgreSQL.
well said @pilif. I've been using PostgreSQL since the early 6.x days as the back end to web sites. As time went on, I developed an Accounting & ERP software suite in C++ with a thin wrapper around libpq. We've pretty much standardized on PostgreSQL for the things you just said: "friendly community, awesome documentation, awesome tools, and of course and awesome product...."
I literally have a multi-million dollar business & product because of it. And without it, we would be stuck with Oracle or MS-SQL.
And we just recently added full text search within our product using the PostgreSQL full text add-on. My customers absolutely love the feature and they love us because of this.
+1 for #postgresql help. Recently I had a really weird problem, and after a few hours with forums and manuals I went there to ask for help. They saved me I don't know how many more hours that would almost for sure not found the completely unexpected answer. Thanks!
But more in general, at least in my experience, understanding and optimizing indexes usage is the most important and most difficult task with postgresql, and improving the documentation about this could really help. Usually they work well on their own, but a few times I was really baffled why pg wouldn't use an index which I created to optimize some important and slow query, even if using the index (when I found a way to "convince" pg to do so) cut the execution time by a factor of 100 or even 1000.
The easiest improvement that comes to my mind would be to add to the manual a better and more in-depth explanation about how to optimize index usage; this could include a FAQ where you could also include my edge case. This current page could be a starting point: http://www.postgresql.org/docs/9.3/interactive/indexes-exami...
An even better but much more long term project would be to improve the EXPLAIN ANALYZE commands; specifically it would be great if it could show the different plans considered, making it easy to understand why a specific plan was discarded. Right now, the only way to nudge pg in the right direction is by trial and error. Also making the explain output easier to understand would help, but I guess that's difficult.
Anyway, thanks again for all the effort of the pg contributors!
Since you're planning on doing some conceptual work with JSONB, just a heads upon one gotcha -
duplicate properties are not allowed. I.E.:
{
task: "do stuff",
task: "do other stuff"
}
which sometimes is useful when you have front-end data with an N-number of entries but its a form that serializes to an object instead of an array. There are other use cases too.
> [SHOULD], or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
Not allowing duplicate keys in JSON objects is very close the exact opposite of a gotcha. Allowing and round-tripping them would be a gotcha.
Just want to say to the PostgreSQL and EnterpriseDB guys that it's always great to see the progress on this. My hopes for 9.5/10 is that we will see PLV8 and Replication baked into the actual release.
PLV8 is such a natural fit with the new JSON(B) types that it's probably going to become the most used extension with that data type... And imho sorely missing from the out of the box experience. I'm glad that they've concentrated on getting the data structure and storage right first. Hopefully we'll see this in vNext.
As to replication, I understand that this is part of EnterpriseDB's business model, just the same not having the basic replication pieces baked in, is still lacking compared to other databases. Even if the graphical tooling was commercial only, and all the knob frobbing via config or command line is more complex, having it in the box is a must imho. I actually really like how MongoDB handles their replica sets, and where RethinkDB is going with this as well. Though they aren't transactional SQL databases primarily, it's a must have feature these days. Replication with automagic failover is a feature that has gone past enterprise-only.
One last piece, would be if there were built in functions similar to the String.prototype.normalize that was added to JavaScript... so that input strings could be normalized easier for comparison/indexing, though PLV8 support could/would bring this readily.
All the same, thanks for all of your hard work, and I look forward to the future of PostgreSQL.
Yes, replication with auto failover is far away from enterprice-only today. If you pay 5$/month you can get a SQL Database in Azure running on 3 nodes with auto failover. One node is synchronous and the second one is async updated. I wouldnt choose a database today unless setting up somerhing like that is obvious and trivial.
JSONB is getting a lot of attention (and deservedly so) but logical decoding is much more exciting to me. Being able to capture postgres transactions to put into a durable log (like Kafka) for downstream processing is a fundamental tool needed to build a unified logging architecture. If you've worked with hadoop you've probably tried to approximate this by hand by taking regular snapshots of your database or something, but this is much, much more sane. Really exciting. Great work postgres team!
Wow, that's fantastic. My queries are littered with that type of construct. Unfortunately, we are using SQL Server and Microsoft hardly ever makes any developer-friendly enhancements to T-SQL.
True enough... I can't even begin to wrap my head around the XML wrangling functions in T-SQL, I feel like a n00b copying JS methods for jQuery when I go there. That said, it's probably one of the easiest RDBMS servers to administrate for small-medium sized deployments.
Switched over to PostgreSQL for a personal project for the json datatype. Great if you want some mongo-esque document storage without losing out on having a relational database.
That info is for PostgreSQL 9.3 (and 9.2). 9.4 comes with other enhancements if you're using json. jsonb in particular opens up whole new possibilities if you're doing heavier work with json data.
Next year I really need to switch from MSSQL to Postgres. The work that the Postgres team have done in the last 2-3 years is really amazing.
They are also clearly reaping the benefits of some very smart architectural decisions, and that gives me the confidence that they will be able to continue innovating in the coming years.
There is actually a lot of overlap in the feature set. When I moved from MSSQL to a gig that was all Postgres it was fairly smooth to do so, the main differences being OS related. Thus you don't need to switch over entirely to add a useful tool to the toolbox.
MSSQL has some advantages in parallel query execution and data warehousing enhancements - if you can afford the licensing for the latter. The other view on that is you can tip up as many Postgres production instances as you have hardware for without impacting the software budget.
That's close to what I'm doing, except running FreeBSD 10.1 in Hyper-V. It's actually pretty easy to set up since FBSD 10.1 is available for download as a VHD. Just need to specify the VHD image file as the disk to attach when creating a new HV VM.
It's somewhat more fun/challenging to configure network access for the VM. Once that's done it's very handy for developing web servers/apps. With the web server running in the VM, the client/browser on the host points to the VM just like any remote site. PostgreSQL in the VM performs quite well.
IMO this kind of development is much more satisfactory in a unix environment than under Windows and Hyper-V provides a convenient way to get there.
It looks like PostgreSQL is on track to slowly succeed MySQL as the de-facto open source database.
Microsoft tentatively seems to be settling on them as the preferred RDBMS for non-Windows platforms [1]:
> Within ASP.NET 5 our primary focus is on SQL Server, and then PostgreSQL to support the standard Mac/Linux environment.
I use EF+SQL Server and they're very much complementary and provide an excellent developer experience. NHibernate+SQL Server is woeful unless you want to use the loosely-typed Criteria stuff. NH's LINQ provider is terrible and it gets confused at the drop of a hat (call Distinct and then OrderBy? "I'm sorry Dave, I'm afraid I can't do that"). At this point I'm convinced only MS know how to write LINQ providers that won't fall over the moment you try to do something useful with them.
Microsoft writing a LINQ provider for PgSql is a great thing for running .NET code on non-Windows platforms.
> Microsoft writing a LINQ provider for PgSql is a great thing for running .NET code on non-Windows platforms.
I've heard nothing of this kind with respect to npgsql, which is the current .NET provider for PostgreSql. Microsoft has never spent a minute on EF support in Npgsql, so that they bother now is a first. The thing is that to support EF in npgsql (or any other ADO.NET provider), the ADO.NET provider has to contain a command interpreter which interprets the command trees coming from EF's linq provider, and which are then to be used to create SQL statements. This isn't simple at all, and as the command trees change with EF7, it will be a struggle for MS to get a lot of ADO.NET providers support EF7 at the start.
Microsoft's only great linq provider is the one in Linq to Sql: it is able to handle a tremendous amount of edge cases. The thing with linq providers is that a general linq provider gets you only that far: a tremendous amount of cases are 'special cases' which have to get their own path to get from the expression-tree to specific sql. e.g.: ctx.A.Select(a=>a.B.Cs);. This gives a set of sets of C instances. To do this, you have to know at the materialization side which C rows belong to which set (as you have to group them by B, which isn't in the projection). Linq to Sql has a specific piece of code for this, it produces a specifically grafted SQL query which contains an extra column so the materializer can know which C rows to group together. EF doesn't, it obtains a big joined soup.
Irony is that EF7's linq provider will be built on Relinq, which is also the base of NHibernate's linq provider, and they didn't re-use the Linq to sql linq provider, which is kind of odd, considering linq to sql's is pretty db agnostic.
Writing a linq provider isn't simple btw. It took me a full year full time to write the one for LLBLGen Pro.
> At this point I'm convinced only MS know how to write LINQ providers that won't fall over the moment you try to do something useful with them.
I would argue Revenj + Postgres provide much better developer experience. But as you said, it's not written by Microsoft so that attitude doesn't help it out.
I'm 100% sure Microsoft can't write LINQ provider which actually understands Postgres and can use it to the fullest (as Revenj can).
The problem with Revenj is that it seems to tied to dsl-platform, which looks like an online service. People would rather to have their compiler tools with them. You don't want your code investments to go 'poof' should the online service fails.
I'm really pumped about the update to GIN indexes, and the ability to concurrently update materialized views. Both enhancements are huge for the postgres ecosystem, and especially for productionizing postgres databases.
Postgres is about to be the new hotness. I mentioned to our hosting provider that we were looking into moving our in-house Oracle and MySQL to Postgres (off Oracle because it's expensive, off MySQL 'cos it's shit) and he said more than a few customers were looking into this precise move.
We're just getting into PG now, and it's just really nice to set up and use. I really wish more web stuff properly supported PG and didn't pretty much require MySQL.
I'm not familiar enough with MongoDB (or Postgres 9.4) to really answer the original question. My guess is that Mongo will still be applicable to certain use-cases, but—like you mention—the majority of users will be those who really don't understand the technologies and their strengths/weaknesses.
Only someone with a poor understanding of the industry would think that. Mongo is killing it in the enterprise and their partnerships with Teradata et al is only going to cement their position for likely decades to come.
Plus MongoDB's pluggable engine approach will definitely breath some new life into it.
This is exactly right. I can't believe how snide and ignorant a few of the commenters are (except I can because it's the same folks with the same FUD every time). MongoDB isn't standing still and it is getting better and better with every release. Not to mention there are 100's of massive projects running on MongoDB with success.
Something about MongoDB really drives the Postgres community (and certain NoSQL DB fans) nuts and I'm guessing it's that MongoDB is eating their lunch, growing faster than them, is gaining popularity faster, etc. Keep in mind MongoDB is many years younger too and is maturing more quickly now. Plus, developers absolutely love working with it.
It's a tool like any other and still has issues but I'm afraid some here are dancing on its grave well before it has even shown signs of letting up on its growth.
People here have a problem with it because it is a technically poor solution, even among nosql databases. The main problems are:
1. Historically mongodb was distributed with completely unsafe defaults. It was insane to use it with any data you actually cared about. Once you toggle on the safety features most of the vaunted performance goes away.
2. It doesn't actually scale that well despite claims that it does.
The fact that there are better nosql solutions just make people further annoyed. It's basically the cargo cult behaviour of mongodb proponents that people don't like.
1. So then the problem they had was they didn't get the performance they expected since the benchmarks were done with different settings to juice the numbers? There is no upside to arguing this one, Mongo loses one way or the other. Can't play them both.
2. I haven't seen anyone say that Mongo scales in an unlimited fashion. You said that. I think what they might be saying is that it doesn't scale well enough to pay for the trade-offs from using it. If you aren't running a system that can be composed of somewhat-interrelated documents, you're gonna have a bad time.
I like Mongo. And until I really learned about it, I got bitten by the default a few times. Personally I use REPLICA_ACKNOWLEDGED when running in a cluster and FSYNC when writing to a single node.
2 is wrong. Mongos shits all over itself after only a few nodes and moderate traffic. It's balancing is incredibly broken and will kill a small cluster very quickly.
"This just in, Mongo is still in the hype cycle in the slow-moving enterprise space! Obviously after having handily crushed the todo list and twitter clone verticals, they want to expand their influence into mega corporations' dev shops who are tired of dealing with shitty Oracle DBAs. Back to you, Chuck."
It might go against the "no transaction" crowd, but seems useful for performance-critical needs. I'm scheduling a bit of testing time with it next week to see if it's something I'd roll out in production (Maria 10 system)
> In 9.4, Logical Decoding supplies a new API for reading, filtering and manipulating the PostgreSQL replication stream. This interface is the foundation for new replication tools, such as Bi-Directional Replication, which supports the creation of multi-master PostgreSQL clusters. Other improvements in the replication system, such as replication slots and time-delayed replicas, improve management and utility of replica servers.
I think the problem is clustering is still very much a duct-tape situation in postgresql with no real clear consensus on how to build out a cluster.
Postgres-XL looks great for scale out, but you need 4 independent types of servers. Even with all those moving parts, it doesn't provide availability. If you want fail-over, you need pacemaker for the data nodes with traditional sync replication, and something like VRRP for your balancer, and something else to failover the coordinator. Several of these pieces can be tricky to set up in a cloud provider.
BDR looks nice, but it looks like there could be lots of gotchas for consistency in there. Maybe it is a magic bullet though... I don't know much about it yet.
Contrast with something like rethinkdb, mysql-galera, cassandra, etc, you start up enough nodes for quorum, tell them about each other, and you're pretty much done. The clients can handle the balancing, or you can use a pooler/balancer.
In my perfect world, I'd install postgresql-awesome-cluster-edition on 3 nodes, add the 3 IPs (or turn on multicast discovery, if my env can support it), and away we go for read scalability and availability. I do this today for mysql-galera, and other than the fact it's mysql, it's awesome. For writes, if you add 4 or more nodes, there should be some sort of shard system like XL has.
That said, postgresql is still clearly the best SQL and even noSQL single node server out there, it's a really great piece of software.
If you honestly believe that all you have to do is stand up a bunch of instances of Mongo/Cassandra/whatever and you instantly get acceptable HA, then you need to read the [Jepsen series](https://aphyr.com/tags/jepsen)
It depends on what you consider "acceptable HA". There are many instances where I'm not trying to protect from a network partition (single data center, monitored batch data loads, etc) and don't have a requirement for that level of tolerance. However, you're right in that it's important to know that nearly every distributed system has edge cases where things might not appear as you thought. Elasticsearch has a section on their Website detailing their resiliency efforts. I wish every company was as transparent about what they're doing on that front so we can all plan and consider expectations better.
I do think (for once) PostgreSQL is addressing it's core weakness and by version 10 will likely have horizontal scalability locked down. The new API is a really positive step.
Because multi-master clustering is extremely hard to do when you need transactional semantics. For those that need it, scaling up is by far the easiest path. Most of us will never exceed the growth rate of server capacity over time, so you can just keep going up. HA using stand-bys and failover is well documented, and quite sufficient I might add.
If you don't need transactional semantics and you do need globally distributed multi-master key/value storage, I would not switch from Cassandra to PG, you already have a good solution for that.
If you don't need transactional semantics and you also do not need globally distributed multi-master key/value storage, use whatever you want, it doesn't really matter.
Cassandra is one of the better ones out there, but you have to deal with its data model and weird consistency promises (which however weird you think they are, are weirder)
The correct way to cluster also changes dramatically depending on your use case. Sure there are things like RAC that promise to make it just work, but those don't scale more than a few nodes.
Mongo is kind of the worst in this - it clusters in one weird way, has bad tooling, and subtly destroys your data at scale.
The general philosophy with postgres is to do it right, or not do it. There are ways to do specific kinds of clustering, but all of them (just like mongo, oracle, etc) have a lot of nuances to them.
If you have a natural shard key, use a bunch of schemas and table inheritance, and eat the downtime during re-shards. Check out citus as well. They have their issues, but they can help you hook up what you need.
I take it you haven't actually used Cassandra much in the last few years. It's data model is almost identical to a typical relational one and it's consistency promises are quite clear:
And I've scaled Cassandra clusters from 1 to 100 nodes in hours with no issues. It really is quite simple. Likewise have had no issues with MongoDB replica sets. It is definitely not "really, really hard".
> postgres is to do it right, or not do it
What a pathetic cop out. PostgreSQL has been around for decades they've had plenty of time to have a proven, stable solution implemented.
Cassandra avoids some of the really visible issues by being AP instead of CP. Hbase hits them, but dodges a bit by only having row level consistency. They are solving very different problems.
The lack of vector clocks in Cassandra can lead to some very non-intuitive (possible wrong) behavior - check out their counter implementation for some rage on that. It's pretty well made though, and I think C*, Hbase and Postgres all have great uses (along with Redis, and a lot of others)
Mongo tends to get things subtly wrong in ways that corrupt data, or that don't scale, and it gives up both A and C.
> A rollback reverts write operations on a former primary when the member rejoins its replica set after a failover. A rollback is necessary only if the primary had accepted write operations that the secondaries had not successfully replicated before the primary stepped down. When the primary rejoins the set as a secondary, it reverts, or “rolls back,” its write operations to maintain database consistency with the other members.
Yeah, it's worse than it sounds though. If you write to 2/3 nodes, and the third gets elected, you can roll back majority confirmed writes. I think it did even worse things when near capacity, but those are harder to pin down.
So the "majority" write concern is not safe. However it seems that the number of nodes that must confirm a write can be configured http://docs.mongodb.org/manual/core/replica-set-write-concer...
From what I read there it seems that if you want to get a confirmation from all the nodes and one gets offline you'll be waiting forever (but there is wtimeout). I'm afraid that any db cluster with any technology has similar problems.
Depending on your exact requirements, you can use pg_shard [1] for real-time inserts/update/delete, or CitusDB [2] for scaling out selects, or combine them.
That is entirely wrong. It might suit your use case (having a bunch of monkeys running around fixing a hack job) but I'd rather use a tool which can expand and reduce capacity when I want, which is quite often. It saves me money on cloud hosting to only provision what I need when I need it and be able to expand and contract it at will, easily. Your solution fails at this.
This is not my solution (and I don't know much about monkeys, sorry).
This is a cost-effective way to be always maxed out (expanded as you say) with fixed price, this can scale well beyond the needs of almost every business.
And , as I said, if you happen to be the next Facebook, Uber, Airbnb or whatever, you will acquire the know-how to scale.
Great news! I'd love to move over to this from MongoDB for a project that has high uptime requirements. But while I think the JSON will really replace it, does PG have a solution for High Availability (like replica sets) in the works?
I'm newer to Postgres so am not sure. Replica Sets are the killer feature for me, more so than just storing JSON documents. I'd appreciate if someone can chime in. I've done some googling but there seem to be multiple strategies for replication.
Every time I look into tools like this they are quite far behind for example AlwaysOn in MSSQL. For example, with PG I have to reseed the original master if it comes back online after an outage. It's as far as I can tell not fully automatic and transparent to me as the guy responsible for managing it. With SQL and elasticsearch+ZooKeeper for example nodes can go up and down without anyone noticing it and me not having to do anything. Is this still a place where PG is behind?
Yes and no. There are differences in the data guarantees that something like elastic search and postgresql give. So it's not really appropriate to compare those.
MS SQL does have easier tooling for replication. The setup for postrgresql is complex and it doesn't come with out of the box tools to easily manage failover and recovery as you mention. Progress is being made in making this easier, but it's still mostly in the low level functionality:
Hey there. I gave a talk at PGConfNYC '14 earlier this year about our solution to HA using open source solutions. The animations and stuff don't really work too well on Slideshare, but it should point you in the right direction:
pgSQL doesn't have anything built in for fencing, failing-over, etc. by default, but by using stuff like Pacemaker, you can get the job done with a little elbow grease.
There are some open-source options, and several closed source options for this... parts are closer to baked in with 9.4, but being realistic, you need to defer to a commercial option.
EnterpriseDB pricing for this isn't too bad (about $7k/cpu-socket/year), which is a lot less than MS-SQL, DB2 or Oracle for most uses... but it's imho a feature that should be in the box.
Allow views to be automatically updated even if they contain some non-updatable columns
Allow control over whether INSERTs and UPDATEs can add rows to an auto-updatable view that would not appear in the view. This is controlled with the new CREATE VIEW clause WITH CHECK OPTION.
Allow security barrier views to be automatically updatable
Whats a good place to suggest a postgresql [json] improvement, my message was intercepted when posting to pg-performance. A major one at the moment is that offset does the select projection on discarded rows (common to use offset in paging), under normal circumstances this isn't a problem, however when does a json operation such as reading a field ->> this causes major performance degrading. Note of course only immutable functions can this optimization apply. There are several workarounds but if PostgreSQL wants to win back some nosql heads it should be straight forward.
In addition to update a json field isn't straight forward, these operations should be supported by first class inbuilt functions.
Its getting close but its not quite a nosql killer yet if they are targetting people who didn't originally come from rdbms background..
Are there any good books coming out that cover Postgres 9.4? I know the docs are okay but I want something with more of a narrative structure as my history with Postgres is spotty. The only one I've found so far is O'Reilly's "PostgreSQL: Up and Running, 2nd Edition" coming out this month but would prefer a personal rec.
I have O'Reilly's "PostgreSQL: Up and Running, 2nd Edition" (You can buy the ebook already) and it's mostly about setting it up and administering postgres and a look at the tools around postgres. It's not a book that goes in depth on actually on how to write SQL for postgres, although it has a secton on this, or develop applications with postgres. It's more of an ops book. Anyway, I like it because coming from MySQL I was unfamiliar with how to set up and use postgres and this book set me straight.
Thanks to Safari, I'm now several chapters into the book and it's perfect for me - thanks :) It was the ops side of things that was the biggest sticking point for me, although it seems like it'll dig into the JSONB stuff later on too.
percept is winding me up ;-) Cooper Press is my company and I wrote a book called Beginning Ruby many moons ago. I wrote the first edition while learning the details of Ruby myself and I'd consider writing a Postgres book under similar circumstances if only I had the time! ;-)
Is attribute order stable? Obviously, order is not preserved, but if the order changes on subsequent accesses, this causes problems if you ever serve content directly from a jsonb field without sorting the attributes manually.
Why would this cause problems? Order is irrelevant in JSON. "An object is an unordered set of name/value pairs."[0] From a there is absolutely no difference between {"name": "Patrick", "age": 24} and {"age": 24, "name": "Patrick"}.
Sometimes it matters whether serialization to a string is a deterministic function. If the JSON is the same, you want the output to be the same. A JSON parser won't care, but it's useful to do a string comparison without parsing (for example when diffing output).
But that's not really a correct thing to expect is it? If an object is an unordered set of name/value pairs then multiple serialized versions of the object may indeed be the same object data.
AFAIK the default serializer does not guarantee the order to be stable, but I haven't participated in the development of this feature that closely. So maybe I'm wrong.
Anyway, it should be possible to write a simple serializer on your own - a trivial PL/V8 function should suffice, I guess. And then you can define a CAST using that function (but generally adding casts is a bit dangerous, as it may have unexpected consequences).
In theory, correct. In practice, here are the two use cases which that approach breaks:
1. Cache control using etags. If the content changes by a single byte, even if semantically identical, the etag should change. Hence '{"foo":1,"bar":2}' is not equivalent to '{"bar":2,"foo":1}'. I can see serving such information directly, or embedding it into a larger JSON response.
2. Committed JSON files. This is an anti-pattern, but one I've seen many times. In one case, it was a translation file that was generated in a separate project. When I joined, it had been like that for more than 5 years. It was like that when I left, although I at least monkey patched REXML to emit sorted attributes in XML (also unordered in theory).
So while in practice, I don't depend on a specific ordering, but I need the order (whatever it is) to be stable. For all I care, it could be sorted by the cryptographic hash of the keys, so long as it is consistent.
Just to gather what others have said here: JSONB appears to store fields sorted lexicographically for binary search purposes, and emits them in that order as well. This could be undocumented behavior, yet important for the above two use cases.
For the former, you really need to sort the keys. If you ever end up passing the JSON through any other encode/decode step, your ETags will break and it won't be obvious why.
For hand-written JSON, I'm not sure there's a good solution besides modding the users' editors to sort for them.
I am pretty happy about the addition of ALTER SYSTEM. I haven't tried it yet, but I think it will make automatic failover to a standby easier to implement. Does anyone have experience with this?
Even though rails hides it from me, I honestly don't mind directly working with the SQL interface on this db, and it's language interfaces are awesome. Thanks, team.
Not trying to be mean but, why wait? If you're stuck on AWS it's not impossible (though difficult and annoying certainly) to get a decently-performant Postgres instance going. The experience setting it up will lessen the vendor-lock that Amazon has on you (convenience always has a price).
...in exchange for a number of different problems including (in no particular order): severe vendor lock-in, mediocre to acceptable performance, and relatively high cost per unit of performance.
I really don't get all this yadda yadda about AWS lock-in.
Which component on AWS doesn't have a OpenSource counter-part that you, having the time, knowledge (;P) and time for it, could not implement on your own infrastructure?
Really... It's an honest question from some one that works as Senior AWS architect on a full time job.
Imagine you're an early stage startup. You have a handful of overworked, stressed-out engineers trying to ship v1.0. You don't have the time or resources for a dedicated Ops team, so you go with AWS (I think this is a great use case for AWS or any public cloud, btw--it just makes sense at a really early stage startup).
You ship your product, get some customers. Fast forward 18-24 months. You have grown enormously, you have lots of customers and revenue projections and expectant investors.
Your (now much larger) engineering team has gotten used to AWS conveniences and leveraged a lot of them in the development workflow. Your architecture consists of several layers of ELBs, you have painstakenly set up autoscaling and deployment via Elastic Beanstalk. Your server backups are AMIs. Your data lives in RDS (analytics in Red Shift). Your webservers use Elasticache. SQS is the backbone of your asynchronous job workflow. Customer email traffic goes through SES. Etc.
Now your AWS bill is something like $25-30k/month. You're at the point where the pricing differential between real hardware and AWS is getting big, and it will only get worse over time.
Now management has to make the hard choice:
- Build an Ops team capable of architecting, constructing and deploying new infrastructure on bare metal
- 3-6 months of work to go from design to final cut-over
- Risk some business interruption in the changeover (downtime, unexpected problems) and slower development iteration until the engineering team gets used to new tools/ways of pushing code out.
or
- Continue on AWS and eat the bill as cost of doing business.
Ultimately this is a good problem to have, no? Your startup is alive and growing, you can afford an ops team, your product proved itself. Most startups will have been dead for years at this point.
> Now your AWS bill is something like $25-30k/month. You're at the point where the pricing differential between real hardware and AWS is getting big, and it will only get worse over time.
Until you're at the point where it's on par with hiring a dedicated Ops team, $25-30K/month will really feel like a drop in the bucket.
This is why so many of us are happy to accept our AWS overlords.
We've been running mostly VMs all this time on our app (Windows, SQL Server, Linux w/ Rails), but have been using more and more ACTUAL AWS services (SQS, DynamoDB, Redshift).
I can maybe see where those latter amazon specific services are a lock-in. But I still don't see it that way, really. They aren't a lock-in from the perspective of the code if you put everything behind service interfaces with different implementations. The big guys like Azure and Google Compute all have similar services for DB as a service, blob storage, VMs, etc.
But the act of migrating to something like Azure is just generally a lot of work in ensuring 0 downtime and a smooth transition. That's hard no matter what technologies we use.
And we've talked about it. But just the act of moving is, in my opinion, a multi-week migration process with load tests etc.
And the cost savings would have to outweigh the amount of engineering costs of moving.
All this to say, the real lock-in is lack of dissatisfaction in our situation. "Amazon works well enough for us."
Different problems, but vendor lock-in isn't really one of them imho. If you are skilled enough to setup pg instead of using RDS you can always move off it to systems under your control.
If you don't have the skills, you can migrate to another provider like Heroku.
As far as my experience, they aren't extending PG in non-compatible ways.
I don't personally get the RDS value proposition, but how does it lead to vendor lock-in, much less severe vendor lock-in? I don't believe there is really any lock-in whatsoever -- backup your database and move it wherever you want, whether you're running the pgsql, Mysql, or SQL Server variants of RDS.
While of course you're correct, it's a bit like saying "you can just pack up your code and run it somewhere else". If the entire architecture and workflow are based around proprietary AWS tools (Elastic Beanstalk, Elastic load balancers, RDS, Cloud Formation, autoscaling, etc) even if your code is technically portable, you're still suffering from vendor lock-in. Arguably worse than the bad old days of the Microsoft monopoly.
RDS is just a database though. All the standard tools work with it. Unless you start dipping into more exotic services, it's pretty hard to become locked in. EC2+ELB+RDS are fairly independent. Even autoscaling doesn't lock you in.
That said, CloudFormation, SQS, Kinesis, DynamoDB, etc. are all very sticky services.
Where you see 'vendor lock-in', I see the opportunity to leverage somebody else's engineering talent to solve commodity problems that our customers don't pay for.
That's a bit of an extrapolation though, isn't it? Many people simply run VMs that connects to some arbitrary pgsql instance, which might happen to be on RDS if they want to simplify the management of that.
Those VMs can absolutely run almost anywhere, and they can connect to just about any instance of pgsql. I heavily use many services on AWS, but can quite literally move the entirety of it to my own servers, Google, Azure, or elsewhere in very short notice. People can choose to use some of the more unique services, but that is not related to RDS.
Outside observer: Man, I figure setting up a "decently performing" Postgres instance on AWS would be such a common thing to do that it would either be known to be impossible, or have cookie-cutter instructions, if not a script. How is it that it's "difficult and annoying" still?
It isn't "difficult and annoying" anymore, it used to be before SSD backed EBS and/or provisioned IOPS because you had to RAID0 together a dozen or so magnetic EBS volumes to get decent disk performance and then deal with the annoyance of sorting out a way to take consistent snapshots of the RAID array for backups.
Now you can just toss a single 1TB SSD backed EBS volume on an instance and get ~3k iops, or use provisioned iops to get almost any performance level you need.
Regardless of whether they're backed by SSD, all EBS volumes on an instance sit behind a 1 Gbps pipe (except for the more exotic and expensive instance types). That's part of the reason why Amazon talks about IOPs instead of raw disk bandwidth.
Go ahead and run:
$ sudo du -hs /*
..on a vanilla m3.* instance and run iotop in a different session. You'll see bandwidth numbers that 2002 would be embarrassed about.
Bandwidth available for EBS volumes varies. For instance, with EBS-optimized volumes it can be 500Mbps, 1Gbps, 2Gbps, or 10Gbps, depending on the instance type, as shown in this chart from an Amazon presentation:
The "NA" of the 8xlarge instance types is because there's no ebs-optimization as an optional feature on that instance type; you automatically get access to 10Gbps.
The SSD EBS claims a default performance of 3 IOPS/GB with a burst of up to 3000 IOPS with a 99% consistency. One of my larger instances uses SSD EBS and so far I am happy with it. In particular, I was very impressed how much faster updating Ubuntu was when I first booted the instance. I am planning to move my Postgres RDS instance to SSD EBS next week. I think (I hope) it will fix a particular stall my application experiences sometimes.
It's not just the setup. For smaller companies with tiny ops teams (or a developer that also serves as an ops guy), RDS can save a bunch of time in maintenance/security/upgrade land. You are going to get decent to good performance for most general cases, and you can easily create failovers and scale up/down without much of any Postgres knowledge. Especially when your product is in "get traction and survive" mode, it's nice to be able to worry more about your app and less about things like running a Postgres server or cluster.
Yes, you'll wait a bit longer for new versions to come out, but that may or may not be a big deal. It isn't for us. We have the in-house knowledge to run our own cluster, but RDS does such a good job that we don't need to.
That really depends on what version you're using now, and what exactly you mean by scalability.
If you're using 9.1 or older, you may see a significant improvement in OLTP workloads on many-core machines (making it linearly scalable to >64 CPUs). This happened in 9.2 (i.e. ~2 years ago).
The main improvement in 9.4 I'm aware of is the GIN fastscan, which significantly improves performance of applications using GIN indexes (e.g. full-text).
Of course, there are many other performance improvements on various places - the principle is not to make the new version slower.
we currently use 9.4 beta, and it's been rock solid for us. We chose it because of the jsonb data type. JSONB has been a great fit for the type of work that we needed it to do.
I just recently began using PostgreSQL albiet an older version. Does this 9.4 mean that MongoDB is now pretty much a dud? Being able to store, manipulate, query JSON data AND have SQL on a established wheel that have been proven reliable and polished far longer than the age of most other databases?
Are there any code examples (preferably Python) that show how to use JSONB? I'd love to see some examples on how to query every record that contains a key in a json, or order rows based on a value in a json object.
off topic: If Meteor.js implements PostgreSQL 9.4 I would seriously consider using it again. That and maybe make DDP scalable.
I'm not a big fan of MongoDB, but I don't think the introduction of JSONB in PostgreSQL 9.4 makes it a dud (which does not mean MongoDB is not a dud for other reasons).
JSONB allows you to do a lot of things that people are often doing with MongoDB (or document databases in general), but there are still some features not available in PostgreSQL. Built-in sharding, for example. There are external tools to do that with PostgreSQL, and I do have my doubts about MongoDB (partially because I only hear about the horror stories), and I expect similar features in PostgreSQL 9.5 / 9.6, but at the moment it's not there.
Not sure what you mean by Python examples - you can either fetch the data as 'text' and convert it in the application (e.g. json.loads) or just use psycopg2 with an adapter (http://initd.org/psycopg/docs/extras.html) and you'll get the data as Python dictionaries.
Just because you haven't used it doesn't put the adoption of postgres on the same level as a still experimental programming language. It's a solid open source solution in the run of the mill domain of RDBMS. It has a weird level of hype these days in certain pockets of the internet as mysql's hayday is long gone, and the whole "no sql" fervor is gone and now understood to have been oversold.
I've worked on projects that were using it a decade ago. To this day at a different job I'm still building on it. Shipping it our products to customers, building internal tools on it, and we've have purchased a handful of self-hosted applications that have happened to run on it.
You're generalizing from your own experience and assuming that's everyone's experience. Availability of dump and load tools to move from mysql don't really that much. MySQL isn't poison just because there is a new hotness. Keeping old shit on mysql, and using postgres for new things is a much more common migration strategy.
Having moved from Mysql to Postgresql in 2005 and continued with it up to the present, I'd say it's a superior database for OLTP, enterprise workloads. Mysql's traditional mode with a transactional engine can work too; though, I've found Postgresql to be a solid, consistent performer for my use cases. I've also used it at home after running into trouble with Sqlite: https://paulrrogers.com/2014/10/self-hosting-owncloud-review... .
Postgresql's excellent documentation made the move much more practical than it would have been using a clearly less documented tool. (One such as MonetDB circa 2005: http://www.monetdb.org/ .)
Similarity to standard SQL and Oracle are also nice if one happens to work in an heavily regulated industry where one may have to migrate to certified platforms.
We certainly are. We're operating an analytical service operated on PostgreSQL - tens of TBs of data, hundreds of machines, tens of thousands of clients. Initially it was running on MySQL, but because of various reasons we migrated to PostgreSQL ~3y ago and never regretted that.
The reasons were both technical (better performance with this kind of workload, great reliability, excellent code quality, ...) and political (we have contributed numerous patches to PostgreSQL - not sure if you ever tried to do that with MySQL).
If you think there's a simple MySQL -> PostgreSQL migration tool (or a migration between arbitrary databases), you're foolish. Databases are not that interchangeable - all databases have some issues with specific workaround, and the 'good bits' are database-specific too. And those things are anchored in the application code, so if you think there's a simple migration tool, you'll be disappointed.
Both Postgres and Rust are grounded in solid, desirable concepts. That's why people like them.
Of course, they are on two sides of the maturity spectrum. Postgres is used by lots of people and we're happy it works so well. Rust is not even released yet, but looks very promising.
It's not just the database itself (and that's awesome on its own right), but it's also all the peripheral stuff: The documentation is seriously amazing and very complete, the tools that come with the database are really good too (like psql which I still prefer to the various UIs out there).
Code-wise, I would recommend anybody to have a look at their git repo and the way how they write commit-messages: They are a pleasure to read and really explain what's going on. If everybody wrote commit messages like this, we'd be in a much better place what code-archeology is concerned.
Patches from the community are always patiently reviewed and, contrary to many other projects, even new contributors are not really required to have a thick skin nor flame retardant suits. The only thing required is a lot of patience as the level of quality required for a patch to go in is very, very high.
Finally, there's #postgresql on Freenode where core developers spend their time patiently helping people in need of support. Some questions could be solved by spending 30 seconds in the (as I said: excellent) manual and some of them point to really obscure issues, but no matter what time it is: Somebody in #postgresql is there to help you.
I think there's no other free software project out there that just gets everything right: Very friendly community, awesome documentation, awesome tools, and of course and awesome product offering to begin with.
Huge thanks to everybody involved.
Also: Huge YAY for jsonb - I have many, many things in mind I can use that for and I have been looking forward to this for a year now.