Hi, Author here! Pull requests are our unit of work, and the queue was created t...

evfanknitram · on Nov 20, 2019

Is anyone "signing off" on the deploys or is it fully automatic? I can't really imagine it being manual 40 times per day, but just wanted to hear.

How do you handle the scenario that some developer pushes a send_me_all_the_credit_card_details() function to the code base which does something 'evil'? Do you rely on the reviewer "doing their works properly" to handle that?

I'm not saying formal "signing off"-steps in processes handle it, but some companies does them for that reason.

wvanbergen · on Nov 20, 2019

We generally require 2 reviewers, and no sign-off on deploys. For PCI-compliant code things work a bit differently, but tries to follow this as closely as possible.

dkoston · on Nov 20, 2019

Interesting. It seems like you have a very flexible process of how to launch code which could contribute to issues with visibility and rollbacks.

I’m curious as to why you had a queue instead of a develop branch before moving to CD? Was this to allow arbitrary commits to be launched to production rather than getting them batched by time?

wvanbergen · on Nov 20, 2019

A `develop` branch has several disadvantages.

You will want to make your `develop` branch the default branch in git and on GitHub, to make sure pull requests automatically are targeted properly (not doing this would be a major UX pain). However, that means that when you `git clone` a repository you are not guaranteed to get a working version.

The `develop` branch can still be broken, which is a problem that needs to be addressed. While you can revert breaking changes (or force-push it to a previous known good sha), and you can automate this process, the pull request is already marked as merged at this point. This means that developers have to open a new PR whenever that happens.

With the queue approach, pull requests remain open until we are sure they integrate properly. Also, we have the opportunity to use multiple branches to test different permutations of PRs, so we can still progress and merge some PRs even if the "happy path" that includes all PRs does not integrate properly.

dkoston · on Nov 20, 2019

Thanks, I was hoping for more of this in the blog post. Since tools are just an expression of process/policy, it’s more interesting to here about the process and why than it is about building “yet another CD tool”. Appreciate the thoughtful and thorough response.

The major pain point I agree with on develop is changing the defaults to merge to that rather than master. It’s a shame this is not easier to do in git/github.

I’m not sure I agree with “develop can still be broken” as an issue that supports a queue. Whether it’s a queue or develop, one should run CI on each change to validate that merging it to master will not cause issues. It’s possible for both to be broken via the same scenarios just as it’s possible for master to be broken. Since CI runs before the branch is merged to develop and upon merge, a failure would “stop the world” and prevent more code from being merged unless that code fixes the failure.

I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

For clarity, I’m not arguing that a develop branch is the way to go, I think CD is much better.

Maybe I’m missing something big here but using multiple branches is permissible in other setups also. You can cherry pick a bunch of commits to a branch and test permutations but only certain branches get deployed to staging and production based on rules.

I’m glad that Shopify has found tools and a process that works. Honestly, I’m just having trouble comparing and constraining this to the other tools that are out there. The article never speaks about other approaches and whether or not they were considered and why you decided to go with a queue. It’s not clear to me if this was a case of improving the existing queue system because it was already in place or whether or not the queue was specifically chosen again because it was better than other alternatives (and why).

wvanbergen · on Nov 20, 2019

> I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

The trick of the merge queue is that it splits the "merging a branch / pull request" in two steps:

1. Create a merge commit with master and your PR branch as ancestors.

2. Update the `master` ref to point to the merge commit.

Normally when you press the "Merge Pull Request" button, it will do those two things in one go. By splitting it up in two distinct steps, we can run CI between step 1 and 2, and only fast-forward master if CI is green.

This means that master only ever gets forwarded to green commits. And because the sha doesn't change during a fast-forward, all the CI statuses are retained. Only when we fast-forward will GitHub consider to pull request merged, so we don't have to "undo" pull request merges when they fail to integrate. If the merge commit fails to build successfully, we leave a comment on the PR that merging failed, and the PR is still open.

When we have multiple PRs in the queue, we can create merge commit on top of merge commit, and run CI on those merge commits. When once of these CI runs comes back, we can fast forward master to it, and potentially merge multiple pull requests at once with this approach.

dkoston · on Nov 20, 2019

I think I see where you are coming from. Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach. You’ll have to check at merge time because getting up-to-date could take a while and master could have changed. Jenkins does this and it can be done in other CI/CD systems with a bit of custom code.

I’d imagine at 1,000 developers and with a monolithic codebase, you’re looking to minimize test runs both from a time and cost of runners perspective.

You may also want to look into Zuul or Bazel if cost of test suite runs is a factor in coming to this solution.

wvanbergen · on Nov 20, 2019

> Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach

That wouldn't work for us due to the amount of changes we need to ship. If you rebase your branch and wait for CI to come back green, chances are another PR will have merged in the mean time, which means your rebased branch is no longer up to date with master. You end up stuck in a rebase cycle.

For this reason, we have no choice but to batch PRs, which is what the merge queue tool does. Faster CI will reduce this problem and we're working on that as well, but won't fully solve this.

dkoston · on Nov 21, 2019

That’s understandable. I’d imagine at some point you’ll need to decouple the monolith a bit in order to work effectively as you scale. Best of luck with the challenge.

byroot · on Nov 20, 2019

The queue is simply an automated "develop" branch.

dkoston · on Nov 20, 2019

From what I gathered in the article, that’s the case now but before the queue required manual merges.

byroot · on Nov 20, 2019

No, even with v1, the merge weren't manual. A bot would merge for you, but directly into master.

Now the bot merges into a temporary that is fast-forwarded as the new master if CI validates it.

dkoston · on Nov 20, 2019

Interesting.

Would you say this is more of a decision based around the constraints of using GitHub or more of the ideal process for Shopify’s needs?

I’m curious because the article doesn’t mention the core reasons that you chose to write your own CD tool versus the other options that exist. The workflow you describe seems readily available in most tools. Perhaps the throughput was causing other options to break?

byroot · on Nov 20, 2019

The ideal process for Shopify’s needs based on the constraints we have to work with (CI speed, deploy speed, rate of changes, etc).