More

dkoston · on Nov 20, 2019

It would be useful in this article to hear about what content is acceptable in a merge request. For example: can these all go straight to queue because they use feature flags? Are commits a "single piece of work", etc.

Not to sound like a downer but this is really an article about fixing a broken process because not running CI on branches before merging to master goes against best practices. Would have loved to actually hear about their work process as this whole article could be summed up as "not running CI on branches before merging their commits to master is a great way to ruin master".

latortuga · on Nov 20, 2019

Well, the problem is that master is a bottleneck. Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with. At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build). It's also wasteful to build each branch against current master because what is "current" will not be when the branch is ready to merge.

Perhaps this problem is what microservices are meant to solve. When you can't coherently integrate code fast enough, attack the bottleneck (master) by splitting it (multiple services).

adrianN · on Nov 20, 2019

Microservices don't really help with this. They just force you to think about your interfaces, but you should do that in a monolith too. If you interfaces are reasonably stable, merging is unlikely to break master if the branch was green before, if your interfaces change rapidly you get problems with microservices too, just one level higher up, where you try to integrate them into a usable product.

gav · on Nov 20, 2019

One of the things I think microservices does help with is thinking about systems being composed of components that are being developed at different velocities and different tolerances of risk.

Imagine an e-commerce site broken into a bunch of services including search and checkout. The search team is making updates daily, trying to improve ranking and drive conversion. The checkout team (assuming that the site is mature and has hit some design equilibrium) may only be releasing changes every couple of months, and if a bug is introduced, the financial impact is a lot higher.

By not bundling the outputs of very different teams together, you can help those that want to "move fast and break things" with their "moving fast" goal, and de-risk breaking everything by reducing the surface area of changes. Microservices-based architectures are a way to reduce friction caused by the structure of your organization and is one outcome of an Inverse Conway Maneuver.

isaacaggrey · on Nov 20, 2019

They do help if a single team of 5-7 developers own a set of microservices; it's unlikely you will have tons of PRs to merge all at once in a single repository with a smaller team. Granted, the ownership is is a bit more clear when talking about a self-contained system that a team owns: https://scs-architecture.org/vs-ms.html

In the SCS literature, you would integrate via async mechanisms across SCSes, provide versioned interfaces, and enforce via consumer-driven contract testing like Pact: https://pact.io

scrollaway · on Nov 20, 2019

> Perhaps this problem is what microservices are meant to solve.

Kinda. Microservices have always been an organizational solution; they're a way to shard your company's work output. Usually that's API contracts, but whatever mechanisms are bottlenecked on the work output is affected, including how many concurrent builds are running due to how many people are touching the code at the same time.

carlisle_ · on Nov 20, 2019

This paper might be of interest to you on this very subject:

https://eng.uber.com/research/keeping-master-green-at-scale/

jrockway · on Nov 20, 2019

We didn't have a merge queue at Google. You rebased if there was a merge conflict, ran through CI again, and hoped there wasn't another merge conflict. I think I ran into merge conflicts maybe once a year, if that.

I think the success of this system breaks down into several parts:

1) Yup, microservices. You could submit your proto change, which would affect all clients, before actually implementing the code that used the new feature. (Or after, in the case of renaming some field from foo to deprecated_foo and refactoring the clients to stop using that field.) That means you could wrangle that change without having to worry about it affecting your actual feature. (Typically proto changes did not cause any breakages since people were very conservative about what changes they would make. Nobody renames all the fields, invalidating dependent code, or renumbers the fields, invalidating all existing messages. You COULD do those things, but nobody ever did.)

2) Clear dependencies in the build system. The CI system only had to run a small set of tests for most changes, because it knew exactly what tests the change would affect. You had to go way out of your way to depend on code without informing the build system. This is very different from every CI system that I've seen outside of Google, which seem to default to running everything and hoping your programming language or build system magically tracks dependencies. It doesn't; Docker for example will happily use random images that it thinks haven't changed, without actually checking if it has changed. (Consider building your app on top of golang:latest. Go is updated, and docker may or may not pull that new base image. Meanwhile, docker will happily clear its build cache if you edit README.md and no code. The result is that 50% of the time you waste 10 minutes rebuilding stuff that didn't change, and 50% of the time you get an outdated build. And nobody seems to care at all!)

3) Being careful about keeping changes small. I don't know what the average CL size is, but I would aim for 100 lines changed rather than 1000 lines changed. This is something that surprised me post-Google, people go away and work for a week and you have a 2000 line PR to review. These are tough to merge and were relatively rare in my experience at Google. It is not always possible to make every change small, but that should be the norm. Figure out how much work you can do in a day, and try to make a CL/PR that is that size. A lot can churn in a week. A lot less churns in a day. If you respected steps 1 and 2, that means your tests will run fast and it's unlikely that your merge will fail between CI and actually merging. If you have 2000 lines of code across 8 services... you'll probably never get it merged. But I am sure that I have successfully merged ginormous changes before, it's just more work.

All in all, my takeaway from this article is that Shopify is huge but I'm surprised that specialized merge tooling was necessary. I wonder what the underlying problem is; do they really have a 1000 developer monolith? Do they not use a proper build system like Bazel?

sshumaker · on Nov 20, 2019

Xoogler here. When I left in 2015 there were definitely teams that used merge queues (i.e. TAP presubmit). Generally these were teams with a more monolithic architecture, like YouTube that had a massive Python mono.

jrockway · on Nov 20, 2019

I guess TAP presubmit might be a merge queue... but it seems different from this. There was no requirement that some mechanical system checked that tests passed before your merged your CL. You could merge any code whenever it was approved. If you felt like running the tests, good for you. TAP presubmit is just that mechanical system that runs your tests before executing the merge. That seems like traditional CI to me, not a merge queue.

Jenkins with a Github plugin behaves almost exactly like this system. Every PR basically has tests run 3 times; once for the branch that the PR is on, once for your branch merged to master, and then once after you do the merge and submit it. TAP presubmit did the "once for your branch" and TAP did the post-merge CI.

TAP presubmit didn't really check that the resulting merge was sound, so you would see TAP presubmit pass, your change get merged, and then have the build break anyway because of the race condition. A merge queue would not have this race condition... so I'm not sure Shopify has one either. The more I think about it the more it sounds like they just rewrote Jenkins. (And for that, I don't blame them.)

glennpratt · on Nov 21, 2019

I have never seen anyone automate what Bors does with Jenkins and have anything approaching decent UX. The closest I've ever seen is a permanent stage branch that sometimes has automatic promotion, little integration with reviews and inevitably breaks every few weeks until some poor soul debugs it.

w-m · on Nov 20, 2019

Point 2 is very important and very hard to get right. For unit tests, there is a clear dependency on the code and you can easily just run a subset of the tests. But wouldn't you have to run any system and integration tests of the affected module, as it's not clear what effects the code change can have? This will blow up CI times again. How did Google deal with this?

lioeters · on Nov 20, 2019

Not sure if this actually answers the question, but - Bazel, the build system used at Google, creates dependency graphs (example: https://blog.bazel.build/2015/06/17/visualize-your-build.htm...), which I believe can be used to run tests on any code affected by a change.

jrockway · on Nov 20, 2019

Your integration test needed the system that you were integrating with, so you'd have to declare that as a dependency.

My philosophy was to always have integration tests run in the normal CI system. This basically meant creating a test binary that happened to link in the systems you were integrating with, and run tests against that. This is easier when everything is written in the same programming language, and for the cases where it wasn't, I was usually happy with "fakes". (https://testing.googleblog.com/2013/06/testing-on-toilet-fak...)

Other teams really loved the sandbox environment with live instances of everything. They would have some machinery outside the standard CI system to inject their code into this sandbox and run some tests, as well as machinery for keeping their sandbox up to date with production. (And adding test data, etc., etc., which all becomes very complex very quickly.)

Both methodologies have their downsides and upsides.

I generally prefer simplicity and speed; people should be able to run the tests on their workstation 100% of the time without having to set up any external resources. If you have an integration test binary that is built from the build system, this is possible. The downside is that config changes in production can break your system; since you are starting up your own instance of some other team's server, they could theoretically make some config change that breaks your integration. Even if you include their configuration in your in-memory version of their service, there was no guarantee that what is running in production is actually checked in yet. (Debugging in production, emergency rollback to an older prebuilt binary, etc.) These were rare and never caused me problems, however, and not having machinery to maintain a shadow environment meant it was easier to work on the code.

Having a sandbox environment was good because you could "check" (not test) big changes before putting them into production. You could try out your flag flip, database migration, mapreduce, or just load up the website in your browser and send your coworkers a link without affecting production data. And you could test your actual production binary in production-like conditions; as long as you sync'd production changes to your sandbox, your automated test probably ran against something that was very much like production. This let you check for more subtle things like performance regressions before deploying. (I worked on a system to do just that.)

The main problem I had with this method was that it was maintenance-intensive (big teams that used this had entire teams just to maintain the sandbox, and that begat sub teams that maintained the sandbox maintenance) and slow. Building and running another test during CI was relatively fast, but starting up a job in production and scaling it up was significantly slower. This meant that you needed a parallel set of tools to run some subset of this environment locally, and it was always painful. Not having your tests in the standard system meant that downstream dependencies wouldn't see test failures in your system when you made a change, so the "buildcop" would have to detect and fix that.

I found this to be too much overhead, but it is probably necessary when you are developing, say, a mobile application. You will have to write some sort of software to make it possible to try your in-progress code on your personal phone. You will probably want to be able to share links with coworkers. I generally like to push changes to production multiple times a day, and make sure that clients can handle a newer server and still work correctly. This way, as soon as a build passes tests, you can start giving it, say 0.1% of production traffic and keep an eye on the error rates, and promote that to production as quickly as possible. The biggest problem I've run into with this strategy is that 0.1% of Google's traffic is way more than enough for a good canary, but at other places I've worked... 0.1% of traffic might be one request over several days. In that case, you have to have staging and manually bug people to try it out. Sometimes I wonder if that kind of software is worth writing at all, to be perfectly honest. If you get one request a day, maybe just make it open a support ticket, and hire 2 support engineers instead of one software engineer. But I digress ;)

strbean · on Nov 21, 2019

Tangentially:

I've seen several blog posts from Google about using fakes and 'hermetic servers' for testing. We use GCP for our product, and unfortunately, Google doesn't seem to care much about making this easy. For example, I think I saw only one or two languages for which the Google Storage client libraries provided "fakes" of a Google Storage server. For PubSub (and maybe one or two other services?) there is the PubSub Emulator, which is unfortunately in Java and isn't supported by any of the CLI tools.

For all their love of fakes and hermetic servers, it would be awesome if they provided them for all the GCP services.

w-m · on Nov 21, 2019

Wow, thanks for the detailed reply. You mentioned a couple of implementations that I hadn't thought about. But I guess the short version would be, as so often: testing systems is hard, and there's no one-fits-all solution.

bob1029 · on Nov 20, 2019

By virtue of having a queue of PRs that need to test & merge, you could pipeline this thing out pretty substantially.

The implication here being that a queue must be processed in-order, so you will ultimately have a perfect sequence of future commits to speculate against, and can incrementally build up each hypothetical future master state for a test build on one of any number of parallel build agents. As the queue depth grows, you would see higher and higher throughput.

username90 · on Nov 20, 2019

> Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with.

Google does it with 50 times the developer count.

> At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build).

True, it is impossible to catch all errors like this, but you can catch almost every error by building and testing it against current master and then merge it with the master 20 minutes later when the build is done. I have seen maybe one build breakage a year being introduced due to this in projects I've worked on, so it isn't a big deal.

iamweswilson · on Nov 20, 2019

For even better accuracy you can use a tool that will run tests against speculative merge states. Zuul[1] is an open source project that supports it out of the box.

[1] https://zuul-ci.org/docs/zuul/user/gating.html

greiskul · on Nov 20, 2019

> building and testing it against current master and then merge it with the master 20 minutes later when the build is done.

And I'm pretty sure that is the way Google does it too. Test a commit against current master, if tests are green commit. Then run tests against master again (and I think this stage might not run for every single commit) to see if anything broke on the rare times there was an actual conflict. If that run was red, which should be rare, then you can have the system do a bisect to find the offending commit, or just run all the ones that haven't been individually tested.

randomidiot666 · on Nov 20, 2019

You have no idea how Google solved it. Basically everyone with a Monorepo (except Google) implements it as a cargo cult best practice. Mindlessly copying Google without understanding how Google actually does it.

tudelo · on Nov 20, 2019

Yet it seems like large companies mostly prefer monorepos, so while it takes investment to have such a monorepo, it seems the benefits are worth the investment.

anon73044 · on Nov 21, 2019

Google, Microsoft, Facebook and Twitter prefer monorepos but this is not indicative of most large orgs.

You'll notice that those listed have had to customize or creat new vcs's to meet their needs.

https://news.ycombinator.com/item?id=17605371 https://news.ycombinator.com/item?id=11789182 https://medium.com/@maoberlehner/monorepos-in-the-wild-33c6e...

0xbadcafebee · on Nov 20, 2019

This is appeal to accomplishment fallacy. Because large companies have a lot of money, whatever they do must be great. But this is false - they do what they do because they are large companies, not because it is a good idea.

At scale, managing complexity can require either a lot of coordination, or a lot of careful planning. Large companies (especially tech companies) don't do either well, so they pick architectures that remove choices, and iterate on them until they are workable. And they have the money and workforce to do it.

marcosdumay · on Nov 20, 2019

This is the problem external libraries were created to solve, in a time when it was a much harder problem.

Microservices are the same kind of solution, with the same gains and costs for this specific problem.

tekstar · on Nov 20, 2019

Shopify always runs CI on branches before merging to master. Everything this article describes is in addition to that, in order to deal with the problems the article talks about at "merge to master" time, like 2 merged PRs failing or a stale PR that passed on branch but fails on master due changes.

At this scale you need to be deploying constantly, otherwise deploys are hundreds of commits large and its impossible to triage - what PR in the deploy broke something, is it even safe to rollback, etc. That is the primary reason to automate deploys and manage the deploy queue.

hinkley · on Nov 20, 2019

It smells like a capacity planning error.

What's the minimum residency time to reliably detect problems with my PR? Add deployment time, double to account for jitter caused by humans being humans (forgetful, lunch, meetings, etc), and there probably are not enough hours in the day for 1000 people to be deploying the same monolith.

To increase residency time you can deploy separate units (You can have multiple deployment units even in a monorepo), and those also reduce the surface area of merges.

Honestly what are they doing with 1000 developers? Duplicated effort goes up considerably with a team and codebase of that size. If you forced me to hire that many people, I'd have a lot of them working on open source, trying to steward feature enhancements that help our process. Because otherwise they'd be running around writing proprietary versions of a bunch of shit that already exists and in a better more documented form.

hinkley · on Nov 20, 2019

And I'm not even a little surprised:

https://engineering.shopify.com/blogs/engineering/introducin....

Folks, when you hire enough devs, they feel empowered to rewrite the world. I have lived all sides of this phenomenon and rarely is it pretty.

Scaling is a concern that goes in both directions. Shopify has 1000 developers today. How screwed would they be if they suddenly had to drop to 600? Or even if there's a hiring freeze? What happens when the people who wrote these tools go work somewhere else?

When I do tool smithing work these days, it's always with an effort to provide the thinnest of shims around open source or commercial tools with healthy user communities, so that at the end of the day they have a larger pool of resources than what is in house. People move on. Money dries up. Mandates change.

"Being important" in a company is about how much you support new work, not how locked in people are to your old work. If you can't give your old work away then you're shackling yourself, both to your current responsibilities and to the company. I can't believe that I'm the only one who has ever stayed at a company out of guilt for how screwed they'd be if I left. But that quickly turns into resentment which is worse.

If you are important for new work, then you always get new challenges. You stay sharp and your resume looks good. If the company stops doing new work altogether, do you really want to stay there anyway? Plus you could always go back to one of your old projects.

dkoston · on Nov 20, 2019

Sorry if my comment was unclear. I consider the queue to be a “branch” as well. Many people use a “develop” branch instead of a queue in this instance. The queue appears designed to allow arbitrary selection rather than merging in order (though the new solution with CD seems generally in order)

Totally agree that CD is required with this many commits. It’s commonplace on teams with many fewer developers. Was surprised to see you folks roll your own workflows rather than using other systems.

Would also be interesting to see if you tag commits that go to master in instrumentation systems so you have visibility into production metrics and can correlate them with what code was running at the time.

wvanbergen · on Nov 20, 2019

Generally our metrics and exception reports are tagged with the sha and the deploy stage.

dkoston · on Nov 20, 2019

Good to hear, that’ll make change management less of a chore.

I think the main thing that was missing for me is the rationale behind building this system rather than building a workflow in one of the existing CI/CD tools. Was there a throughout bottleneck in existing tools? Was there something custom about your workflow that wasn’t supported elsewhere? I may be wrong but the workflow you landed upon seems pretty common so I’m curious as to why the need to build and maintain a tool in house for this?

jacktli · on Nov 20, 2019

Hi, Author here!

Pull requests are our unit of work, and the queue was created to support all pull requests. We do have feature flags as a tool, but we let our developers make the judgment call on how their changes should be rolled out.

evfanknitram · on Nov 20, 2019

Is anyone "signing off" on the deploys or is it fully automatic? I can't really imagine it being manual 40 times per day, but just wanted to hear.

How do you handle the scenario that some developer pushes a send_me_all_the_credit_card_details() function to the code base which does something 'evil'? Do you rely on the reviewer "doing their works properly" to handle that?

I'm not saying formal "signing off"-steps in processes handle it, but some companies does them for that reason.

wvanbergen · on Nov 20, 2019

We generally require 2 reviewers, and no sign-off on deploys. For PCI-compliant code things work a bit differently, but tries to follow this as closely as possible.

dkoston · on Nov 20, 2019

Interesting. It seems like you have a very flexible process of how to launch code which could contribute to issues with visibility and rollbacks.

I’m curious as to why you had a queue instead of a develop branch before moving to CD? Was this to allow arbitrary commits to be launched to production rather than getting them batched by time?

wvanbergen · on Nov 20, 2019

A `develop` branch has several disadvantages.

You will want to make your `develop` branch the default branch in git and on GitHub, to make sure pull requests automatically are targeted properly (not doing this would be a major UX pain). However, that means that when you `git clone` a repository you are not guaranteed to get a working version.

The `develop` branch can still be broken, which is a problem that needs to be addressed. While you can revert breaking changes (or force-push it to a previous known good sha), and you can automate this process, the pull request is already marked as merged at this point. This means that developers have to open a new PR whenever that happens.

With the queue approach, pull requests remain open until we are sure they integrate properly. Also, we have the opportunity to use multiple branches to test different permutations of PRs, so we can still progress and merge some PRs even if the "happy path" that includes all PRs does not integrate properly.

dkoston · on Nov 20, 2019

Thanks, I was hoping for more of this in the blog post. Since tools are just an expression of process/policy, it’s more interesting to here about the process and why than it is about building “yet another CD tool”. Appreciate the thoughtful and thorough response.

The major pain point I agree with on develop is changing the defaults to merge to that rather than master. It’s a shame this is not easier to do in git/github.

I’m not sure I agree with “develop can still be broken” as an issue that supports a queue. Whether it’s a queue or develop, one should run CI on each change to validate that merging it to master will not cause issues. It’s possible for both to be broken via the same scenarios just as it’s possible for master to be broken. Since CI runs before the branch is merged to develop and upon merge, a failure would “stop the world” and prevent more code from being merged unless that code fixes the failure.

I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

For clarity, I’m not arguing that a develop branch is the way to go, I think CD is much better.

Maybe I’m missing something big here but using multiple branches is permissible in other setups also. You can cherry pick a bunch of commits to a branch and test permutations but only certain branches get deployed to staging and production based on rules.

I’m glad that Shopify has found tools and a process that works. Honestly, I’m just having trouble comparing and constraining this to the other tools that are out there. The article never speaks about other approaches and whether or not they were considered and why you decided to go with a queue. It’s not clear to me if this was a case of improving the existing queue system because it was already in place or whether or not the queue was specifically chosen again because it was better than other alternatives (and why).

wvanbergen · on Nov 20, 2019

> I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

The trick of the merge queue is that it splits the "merging a branch / pull request" in two steps:

1. Create a merge commit with master and your PR branch as ancestors.

2. Update the `master` ref to point to the merge commit.

Normally when you press the "Merge Pull Request" button, it will do those two things in one go. By splitting it up in two distinct steps, we can run CI between step 1 and 2, and only fast-forward master if CI is green.

This means that master only ever gets forwarded to green commits. And because the sha doesn't change during a fast-forward, all the CI statuses are retained. Only when we fast-forward will GitHub consider to pull request merged, so we don't have to "undo" pull request merges when they fail to integrate. If the merge commit fails to build successfully, we leave a comment on the PR that merging failed, and the PR is still open.

When we have multiple PRs in the queue, we can create merge commit on top of merge commit, and run CI on those merge commits. When once of these CI runs comes back, we can fast forward master to it, and potentially merge multiple pull requests at once with this approach.

dkoston · on Nov 20, 2019

I think I see where you are coming from. Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach. You’ll have to check at merge time because getting up-to-date could take a while and master could have changed. Jenkins does this and it can be done in other CI/CD systems with a bit of custom code.

I’d imagine at 1,000 developers and with a monolithic codebase, you’re looking to minimize test runs both from a time and cost of runners perspective.

You may also want to look into Zuul or Bazel if cost of test suite runs is a factor in coming to this solution.

wvanbergen · on Nov 20, 2019

> Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach

That wouldn't work for us due to the amount of changes we need to ship. If you rebase your branch and wait for CI to come back green, chances are another PR will have merged in the mean time, which means your rebased branch is no longer up to date with master. You end up stuck in a rebase cycle.

For this reason, we have no choice but to batch PRs, which is what the merge queue tool does. Faster CI will reduce this problem and we're working on that as well, but won't fully solve this.

dkoston · on Nov 21, 2019

That’s understandable. I’d imagine at some point you’ll need to decouple the monolith a bit in order to work effectively as you scale. Best of luck with the challenge.

byroot · on Nov 20, 2019

The queue is simply an automated "develop" branch.

dkoston · on Nov 20, 2019

From what I gathered in the article, that’s the case now but before the queue required manual merges.

byroot · on Nov 20, 2019

No, even with v1, the merge weren't manual. A bot would merge for you, but directly into master.

Now the bot merges into a temporary that is fast-forwarded as the new master if CI validates it.

dkoston · on Nov 20, 2019

Interesting.

Would you say this is more of a decision based around the constraints of using GitHub or more of the ideal process for Shopify’s needs?

I’m curious because the article doesn’t mention the core reasons that you chose to write your own CD tool versus the other options that exist. The workflow you describe seems readily available in most tools. Perhaps the throughput was causing other options to break?

byroot · on Nov 20, 2019

The ideal process for Shopify’s needs based on the constraints we have to work with (CI speed, deploy speed, rate of changes, etc).

lreeves · on Nov 20, 2019

We (Shopify) still run the full CI on each development branch as the article mentions:

> We check if Branch CI has passed and if the pull request has been approved by a reviewer before adding the pull request to the queue

0xbadcafebee · on Nov 20, 2019

> Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with. At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build). It's also wasteful to build each branch against current master because what is "current" will not be when the branch is ready to merge.

I'm starting to think most CI problems are just people not looking at the problem the right way. Here is the problem re-worded:

- When a PR has a green light and someone hits 'merge', it locks anything else from being to merge to master, and you merge your PR. When it finishes merging and deploying, now all the other PRs waiting have to rebuild themselves to see if they will merge with this new state of master. So 100s of PRs are rebuilding every time you merge one PR, and there's constant CI churn.

Here is why that problem exists:

- The system was designed for 1000 developers to all be writing to the same code base.

Here is how you solve that:

- Don't let 1000 developers all write to the same code base. Break the code down into discrete components that different small teams manage. The only bottleneck for that code base is that small team.

This small team is often called the two-pizza team, and their discrete components are often called microservices.

robocat · on Nov 20, 2019

Google don't solve it your way: https://news.ycombinator.com/item?id=21586180

0xbadcafebee · on Nov 20, 2019

Yes, that's correct, Google invented its own proprietary distributed object store and distributed version control system and distributed Linux-only filesystem and distributed build-and-test-system to work with a single SDLC that its entire company must follow strictly to release anything, just so it could keep using a single repository.

What's your point?

robocat · on Nov 21, 2019

Clearly given those costs, Google really believe in mono-repo, and presumably they have tried to back it up with internal stats?

Although hard to get stats without control group - maybe control group could be acquisitions?

randomidiot666 · on Nov 20, 2019

You have no idea how Google solved it. Basically everyone with a Monorepo (except Google) implements it as a cargo cult best practice. Mindlessly copying Google without understanding how Google actually does it.

rb808 · on Nov 20, 2019

Literally the definition of CI is to run on master or release branches a few times a day, not on every dev branch.

https://en.wikipedia.org/wiki/Continuous_integration

judge2020 · on Nov 20, 2019

The definition is sourced from here[0], and says "Each check-in is then verified by an automated build, allowing teams to detect problems early.". It's not a hard rule to "not run on every dev branch".

"continuous integration" is often just "npm ci && npm run build", and sometimes "npm run test" (or similar for your language). For products that don't make any remote API calls (or when they use a faker service), most of this is done on the same machine and costs very little to do on every commit, making it easier to precisely define which commit broke something.

0: https://www.thoughtworks.com/continuous-integration

dkoston · on Nov 20, 2019

Rather than being pedantic about the definition, maybe you could share your experiences with why that’s superior to validating each branch? Being dogmatic about a definition rather than experimenting with works best in production at your business seems illogical.

dkoston · on Nov 9, 2019

In my experience, you don’t. With tons of IOPS you need EBS or crazy expensive instances.

Instead you use a cloud like google cloud where you can add NVMe SSDs to whatever instance type you need and configure custom RAM and CPU instead of picking from the super expensive AWS instances with no configurable options and almost always the wrong resource allocations for your workload.

Source: testing my infrastructure that requires 60,000 iops on both google cloud and AWS and it being 1/4 the cost and higher performance on Google. Of note: this was a very high throughput streaming data application. YMMV for other applications.

dkoston · on Oct 15, 2019

I've got about the same members and attendees. This will drive the cost up about 3x but honestly, that wasn't the part we were concerned about most.

A couple of things: 1. At no point in this change did meetup suggest they were offering new things for subscribers that would justify a 3x cost increase. 2. Asking our members to start paying would now incur so many logistical costs on us: we'd now have to spend a bunch of time discussing with our members why they had to pay for something that was previously free. We'd also have to create email copy describing the change to them. We'd also have to deal with customer service issues where people thought we were charging them rather than meetup and they'd want a refund if they didn't show up. We'd also have to explain how charging met or didn't met our core values. 3. $2 isn't enough in the part of the world we're in to "prevent people from not showing due to cost". That's less than a cup of coffee so it's not much of a sunk cost to not show up. We could of course charge much more than $2 but that's a radical shift for something that was previously free. 4. Do we now charge or sponsors on an "on demand" basis based on how many members that show up? Or do we overcharge every sponsors based on the fear of how many people will show up? What used to be a $X per event cost has now added more unknowns that make logistics harder.

#1 is a small expense but #2 is a radical shift in the values and relationship between our group and our members. The logistical costs of changing how we interact with our members are MUCH higher than the added subscription fees per year. These kind of changes make almost no sense to us as we can't explain how meetup needs a 3x cost increase to deliver the same product it did yesterday.

In addition, the communications about this were terrible. A total of 0 times did they explain why this was happening and what our increased costs would bring us. They also tried to frame this as "lower subscription fees" but the total cost of the platform went up.

This degraded all trust we had in the platform and kicked off a search for alternatives immediately.

FYI, I told the Meetup focus group the same thing.

dkoston · on Sept 16, 2019

1Password has a Families plan for this.

mdesq · on Sept 16, 2019

I will look into this. I do trust certain family members with access to my vault password, but the notification of access and ability to give access to a trusted third party (my lawyer) that is available with LP is very compelling.

dkoston · on Sept 16, 2019

1Password allows you to add family members with access to specific values either read-only or read/write. The system for adding access is multi-step so unless you add someone to a vault they shouldn’t see, you have the flexibility to share as little or as much as you want. Since you can name the vaults you can name them things like “Shared with M Toussant (Attorney)” or “Samir Martha and Paul” which can make it easy to determine where to store what secrets. Have been using Business for a few years with some of my companies and Family with my family and have had good experiences. You can initiate recoveries as the administrator as well which has been helpful in both cases.

dkoston · on Sept 1, 2019

There’s no misconception. Until I commented, there was no disclaimer stating that the code was not ready for production use and was for research purposes only.

Very few people are qualified to review and critique this type of codebase so it’s dangerous to market a repository backed by multiples professors and PhDs as “it does X, Y, and Z”

It would be great to live in a world where people did a serious review of projects before downloading them but that’s not the reality we live in.

OP didn’t provide any context either that this was focused on reviewing the paper and concepts in it so I think what you mean is that “you think people should focus on the paper” but “we focused on the code”, that’s not a misconception, it’s a lack of clarity by both OP and the project website about what’s important.

If the code was irrelevant, it should be removed from the website so that the audience could focus solely on the paper.

javert · on Sept 2, 2019

> it’s a lack of clarity by both OP and the project website about what’s important

As a former academic, I can immediately spot this website as an academic paper with a bit of code on the side. Academics do this kind of thing constantly because they get kudos for it in academic circles. Yes, there is a lack of clarity (if you are not an academic). That's why I pointed this out for what it is.

> If the code was irrelevant, it should be removed from the website so that the audience could focus solely on the paper.

I mean, you're not wrong. But then they wouldn't get as many kudos from their academic buddies.

dkoston · on Sept 2, 2019

I agree that this stuff is common place and I definitely understand that in today’s world of noise that you have to market yourself and your papers to get attention.

I was glad to see them add the disclaimer as hopefully it’ll turn away people who see “hey, here’s this secret management project made by a bunch of professors and PhDs, I should use it because they are super qualified”.

My goal was to provide feedback to the students about how people evaluate code “in the real world”. It’s not helpful for them to be applauded for “sub par” code but you raise a good point that it can seem harsh to not provide feedback on the “meat of the project” which is their paper and research.

dkoston · on Sept 1, 2019

Not sure if OP is the author but thanks for sharing. Secret management is a challenge for a lot of folks.

For the authors, here are a couple of items that made it hard for me to evaluate the project:

1. This project doesn't build for me and some of the dependencies don't exist or are private repos which prevent me from building the project (https://github.com/CHURPTeam/CHURP/blob/master/src/cmd/bb.go...)

2. It's lacking godoc documentation which means it's hard for me to quickly see how the API works. As such, some of the API methods seem less than useful.

For example, `(Optional) storeSecret(SK)`. What does "optional" mean? What's the return value? What's SK?

(Optional) retrieveSecret() -> SK: What? I can only store a single secret? Without passing params to this, it seems so.

3. Project structure is not conventional (cmd/ inside src/)

This may not seem like a big deal but am I really going to trust my secrets without someone who didn't learn enough about go to use godoc and use a conventional project structure?

4. No tests

Again, doubtful I'm going to trust my secrets to a distributed network that's not tested.

mskd12 · on Sept 1, 2019

One of the authors of the work here.

1. Thank you for your comments. The project is an early-stage research prototype, and we are soon going to add documentation and tests to the code.

2. In fact, not all functions in the API are fully implemented, the development is under process. At a high level, the functionality provided is to store and retrieve a secret key (SK). The function is denoted optional because in practice, the secret might not be inputted, instead it might be generated randomly.

3, 4. We appreciate the comments. At this point, we’re still adding more features and improving the documentation. We plan to release the code with tests in the future.

PS: Just added a disclaimer stating that the code is under development!

dkoston · on Sept 1, 2019

Appreciate the disclaimer.

One of the challenges you are facing is that people consume software illogically based on marketing and popularity rather than code review and fit.

Making sure to market and disclaim that this is a research implementation and should not be used in production is important, especially in cryptography where most people are novices and wouldn’t be qualified to review the code either way.

dkoston · on Sept 1, 2019

For #2, it seems unlikely that people would have a single secret.

I’m not sure if the codebase is more just to prove the concept from the paper or to potentially get adoption. If you are looking for people to adopt the project, I’d suggest a way to store and retrieve multiple secrets.

Best of luck with the research and project!

mskd12 · on Sept 1, 2019

In many scenarios, I'd think that different user-specific secrets can be derived from the single secret stored, perhaps using a PRF. Why wouldn't this be enough?

Thanks!

dkoston · on Sept 2, 2019

You can use a PRF for the domain to derive additional secrets. It simply depends on the scope of the project and who you think your users are.

For example: if your user base is cryptocurrency wallet holders who have multiple secrets for each wallet, will they construct a secondary library on top of yours to manage those additional secrets? Why wouldn’t they choose to derive each secret independently? If millions of dollars are at stake, would you risk any shared state from your secret derivation function?

It would be unnecessary to derive additional secrets with this library to prove the concept in the paper so I don’t think it’s necessary if that’s the goal of the code. However, if mass adoption of the techniques you’ve created is your goal, a more user friendly API which doesn’t require each end user to “roll their own code” to manage multiple secrets should be a goal.

One of the first major projects I worked on was essentially a wrapper around open source software that provided “ease of use APIs and UIs”. Because the average user was not technical, the convenience wrapper became highly valuable and is used by hundreds of millions of sites today (cPanel).

One of the biggest challenges as a technologist is to understand to what degree most people are not technologists, even fellow programmers. For example, I’ve worked with skilled programmers with impressive resumes who had issues troubleshooting CORS because they never learned how headers are defined and where to look up the RFCs.

As mentioned above, providing an API for multiple secrets could be out of scope for a bunch of reasons. If you’re looking for mass adoption by developers, I’ll wager an Omakase at the sushi place of your choosing that it’ll be required.

sverhagen · on Sept 1, 2019

I know that this isn't universally accepted, but in significant parts of the industry, we consider it vaporware if there's no tests... FWIW.

brachi · on Sept 1, 2019

> No tests

yikes, that's hard to justify.

javert · on Sept 1, 2019

No, no it's not. The product here is an academic paper. The code is just kind of a sanity check for the concepts in the paper. That's how academia works.

sverhagen · on Sept 1, 2019

If you're positioning it as a product, which is how many here perceived it, that reasoning doesn't hold. Not that every product has good testing, but they should...

javert · on Sept 1, 2019

It's not a "product." It's an academic research paper.

jMyles · on Sept 1, 2019

I don't really agree that "that's how academia works". What other recent reference papers for cryptographic primitives are you thinking about when you say that?

javert · on Sept 2, 2019

I don't have any in mind, but this paper struck me as being in "distributed systems" more than in "cryptographic primitives." That is totally debatable, though.

Anyway, I have experience in the former field (and adjacent fields) but not the latter, so that's where my cynicism is coming from, if you are wondering.

You are probably right that there are higher standards for papers dealing with cryptographic primitives. That would be nice!

dkoston · on Aug 26, 2019

Most likely the API Gateway speaks HTTP so it's no different than communicating with any other HTTP API. The API Gateway has a number of roles: routing requests to the right backend(s), checking the health of one or more backends, providing a single point of contact to the outside world (reduces the number of systems you have to lock down related to access from the public), holding api documentation, and others.

Think of the API gateway mostly as a router that allows you send requests to many different pieces of software behind it.

As far as patterns for HTTP Status Codes, you should take a look at the RFCs that define them: https://tools.ietf.org/html/rfc2616#section-10

dkoston · on Aug 16, 2019

This is a manufacturing defect. Contact Tesla service and they'll fix it. I had some areas of the factory tint on the moonroof that has holes in it and they replaced the moonroof without any hassles.

rconti · on Aug 16, 2019

It was replaced without a problem during the only service visit I've had. In fact, they even had the glass in stock. (it required replacing the entire rear/roof glass to the b-pillar, since that's one whole piece).

dkoston · on Aug 16, 2019

Same here on only service visit. That glass is huge! When I went to get the car tinted they didn’t even carry a piece of tint that large

dkoston · on Aug 16, 2019

Agree that this seems suspect. Have had free domains with plenty of traffic on them for years.

dkoston · on Aug 1, 2019

We’re not even close to having the tech for bulletproof clothing that’s the same weight as jeans. Most level 4 body armor is either ceramic plates or steel plates. For level 2 (9mm), there are vests as light as 5 lbs. Midweight jeans typically weigh around 1 and would likely weigh 10+ lbs with the lightest level 2 available and be much thicker and stiffer.

Along with the invention of new tech, the price point seems at least an order of magnitude off for the foreseeable future.