The architecture of Uber’s API gateway

papito · on May 23, 2021

I think you will find that Uber engineering is just like any other place - a lot of silly mistakes while people learn on the company dime.

This entry, for example, mentions how they avoided Go routines as a "performance concern" even without any data to prove it. It's bush league level to think you can do it better yourself.

https://eng.uber.com/go-geofence-highest-query-per-second-se...

thundergolfer · on May 23, 2021

That post you linked is the one that got heavily criticised by a Bing Maps engineer for being under-engineered[1].

1. https://medium.com/@buckhx/unwinding-uber-s-most-efficient-s...

Aeolun · on May 23, 2021

While the linked post maybbe right. I completely agree with the Uber engineers that it sounds too complicated. I can understand the double polygon search without thinking about it.

I also understand the double polygon search with bounding box optimization, so I’m not sure why that wasn’t used.

Closi · on May 23, 2021

If that’s the case, it’s weird that they would take the time to write a blog post about how great their algorithm is while effectively saying that they didn’t understand the other approaches and that they were too hard.

taf2 · on May 23, 2021

This is probably harsh but the description of their api gateway sounds like the description of corba from the late 90s… probably if you are just getting started and need something simple but powerful go with openresty… IMO it gives you all the benefits and it’s super fast/lightweight … you can get advanced with cookie sharing / signing or just do simple logging to statsd… it’s really good if you are starting out to get operational experience running nginx / statsd etc… IMO

fosk · on May 23, 2021

For anybody looking at OpenResty, it’s also worthwhile to a look at Kong, which is the largest openresty-based application and already provides the right abstractions in place for API management: https://github.com/Kong/kong

dtauzell · on May 23, 2021

The people defining CORBA faced many of e same issues we face with micro services today.

TeMPOraL · on May 23, 2021

After working with Distributed COM, I'd go as far as to say that people building DCOM software were doing microservices, way before that term was known or popular - microservices at much finer granularity (per object), and all the bells&whistles people make startups off today - like load balancing, service discovery, strong auth, etc. - were already built into the platform and properly integrated.

Alas, it was ahead of its time; the legacy of C makes it hard to work with.

barrkel · on May 23, 2021

Eh, but stateful and synchronous and chatty. I don't think DCOM has too many good ideas, beyond what it shares with CORBA - mostly, an IDL. It's good to have an IDL if you're going to be serious about heterogeneous implementations.

plank_time · on May 23, 2021

CORBA never scaled to the level that Uber needs to. Pre-pandemic, you’re talking hundreds of thousands to millions of requests per second globally. Also you never saw a CORBA project with thousands of engineers checking in code multiple times a day. All these are factors too.

People on HN always think “all it’s doing is matching a rider with a driver, why is it so complicated? I can write it in a weekend!” Sure, but it won’t scale.

Aeolun · on May 23, 2021

I doubt they have millions of driver to rider matching requests per second. A lot of people use Uber, but they don’t need a ride every second.

plank_time · on May 23, 2021

At peak I can see it being low single digit millions per second. A single client will do more than 1 rps, there’s a lot going on per client.

alufers · on May 22, 2021

This GUI must have a pretty advanced versioning and rollback support, otherwise I can see one user borking the whole API with a bad change and nowwhere to check what happened.

junon · on May 23, 2021

It did not. The deployment system at Uber was a f*ing nightmare.

Things would fail, rollback, and then the logs would have their errors truncated or something. I wasted so many days deploying botched releases from coworkers.

And the use of Phabricator at Uber was a nightmare. LLVM uses it correctly, IDK what Uber did but it was a PITA to do really anything.

Pair that with the siloed off teams where "every team is its own startup" mentality and you have constant fighting, power grabs, being blocked all the time, etc.

Aeolun · on May 23, 2021

Look at it positively. At least the rollback is working. Our version of rollback is to redeploy the old version and hope that the database migrations will work with the old one.

junon · on May 24, 2021

That was absolutely a problem at Uber, too.

throwawayuber29 · on May 23, 2021

Former Uber employee who worked on this system. It maps changes onto git changes (technically Phabricator diffs) and this system open the diff for you.

madan · on May 23, 2021

Just to add, the changes are just config changes at this point and not code changes.

One one user talking own all APIs, the system has ability to skip unmountable APIs, but we catch it during user interactions with tons of test and validations.

menzoic · on May 23, 2021

You only change one endpoint at a time. Fixes are rollforward since you would be reverting changes to other endpoints if you rolled back.

timhwang21 · on May 23, 2021

Enjoying the article so far, but some things make it a bit hard to read. 1) code snippets are screen shots, and 2) some links point to an internal Google Docs page.

abhishekparwal · on May 23, 2021

oops.. thanks for sharing this. we will get that corrected.

0xbadcafebee · on May 23, 2021

It's a decent overview but you could easily do a whole blog series on each category of the gateway that they touch on. Is there a deeper dive for each part? Like, just the AuthNZ could get very complicated depending on requirements (as they have multiple implementations, that causes a mini crisis of what to rely on for what and how)

mvzvm · on May 23, 2021

I worked at Uber for a few years.

This was a terrible layer. Everyone hated it when I was there. People ended up building logic directly into the API gateway because it was so difficult to use.

I am so glad to never have to look at RTAPI again.

menzoic · on May 23, 2021

This article is about the newer Edge Gateway and doesn't mention anything about RTAPI. When did you leave Uber?

mvzvm · on May 23, 2021

RTAPI is the internal name. Why would it be mentioned here?

I left too late. The engineering in that company was abysmal.

ojhughes · on May 23, 2021

RTAPI is mentioned in the previous blog post [1], it sounds like the new edge gateway supercedes it

[1] https://eng.uber.com/gatewayuberapi/

mvzvm · on May 23, 2021

Nice find. I stand corrected.

madan · on May 23, 2021

One of the authors here, happy to answer questions.

DandyDev · on May 23, 2021

Did you consider using existing API Gateway solutions, either OSS or commercial? Why did you decide to build your own?

Given that Uber is an engineering heavy and tech-centric organization, why did you choose to do configuration through the UI? Why not configuration and infrastructure as code?

bilal4hmed · on May 23, 2021

The issues that you had with Go 'Language naming conventions like ID, HTTP, and reserved keywords in Go (but not in Thrift) created failures that exposed the internal implementation details to the end users.'

How did you go about solving those ?

Did yall work with the Go team to resolve the other issues that were stumbled upon?

madan · on May 23, 2021

Since the final artifact was generated, we were able to work around by annotating in thrift with alternate field names.

exhaze · on May 23, 2021

Just make sure to not accidentally add an extra space in the Thrift annotation, or else it will globally bring down upfront pricing

edwardmp · on May 23, 2021

If you guys could easily start over, would you have still picked Go given the problems you encountered? Or would Java have been a more likely candidate?

papito · on May 23, 2021

Why not just use OOP Scala? It's so much cleaner and more readable.

hamandcheese · on May 23, 2021

I tried starting an “OOP Scala” project. The entire Scala ecosystem, at least the parts that seems good and active, all seem to be FP oriented.

If you simply want a better Java, I’d try Kotlin. Or Java 15.

xiwenc · on May 23, 2021

Was existing API management solutions like Mulesoft considered?

And while on the reuse topic, will Uber open source this?

bsaul · on May 23, 2021

Side question : Anyone knows a gateway solution able to do blue/green deployment ?

ojhughes · on May 23, 2021

Envoy based gateways such as Istio ingress gateway or Ambassador use traffic shadowing to achieve this. Traefik also has shadowing

RKearney · on May 23, 2021

[flagged]

tptacek · on May 23, 2021

This has nothing at all to do with the article.

RKearney · on May 23, 2021

I disagree. The article is about Uber’s API. My comment is about how Uber’s API clearly doesn’t have robust support for requests that did not fully execute. Based on my experience it appeared as though one system had processed my request to schedule the trip, but the rest of the system did not. This caused my account to essentially be soft locked by a trip that was not visible through the app (and by extension their API). I then added my antidotal experience with their lackluster support.

tiwarivikash · on May 23, 2021

Which technologies have been used to Implement it? Framework? Language?

nuclearnice1 · on May 23, 2021

> At the time of development of the gateway, our language choices were Go and Java. Our previous generation was in Node.js. While that was a very suitable language for building an IO-heavy gateway layer, we decided to align with the languages supported by the language platform teams at Uber. Go provided significant performance improvements. The lack of generics resulted in a significant amount of generated code during build time to a point where we were hitting the limits of the Go linker. We had to turn off the symbol table and debug information during the binary compilation. Language naming conventions like ID, HTTP, and reserved keywords in Go (but not in Thrift) created failures that exposed the internal implementation details to the end users.

abhishekparwal · on May 23, 2021

One of the author this blog post. We had to do lot of other scaling optimizations like generating code for only IDL elements used rather than a big fat IDL, replacing ser/deser generated code with dynamic generation of code, etc. We will share details in further blog post on that.