Hacker Newsnew | past | comments | ask | show | jobs | submit | QuentinM's commentslogin

Sounds about right. And not the first time it happens either. I recall getting a few of those instant unit 3 panic over the past few years with Ubuntu. Often with things not as common out there in production, like tc (which in our case we were using in production to work around conntrack race conditions), and sometimes we also got non-panicking but absolutely production/nerve wrecking issues like TCP window size calculation overflows after the window went to zero due to a temporary slow consumer - freezing the window size to a few bytes only instead of getting a prompt full window recovery.

Not to mention we’ve also had our fair share of production triple faults from bugs in the Intel firmware patches for Spectre, which took weeks to investigate & fix between ourselves struggling to keep our exchange up & running, Intel, and AWS.

And that is why there’s value in the CoreOS/ContainerLinux-like solutions we designed & implemented nearly a decade ago now. Being able to promptly rollback any kernel/system/package upgrades at once - either manually or either after it’s detected a few panics in quick successions is actually quite awesome. Not to mention the slow update rollout strategy baked into the Omaha controller.

But the reality is that the what-ifs are always the hardest to market, nearly always after-thoughts and with fast-spiking/fast-decaying traction after major events.


It really seems like there’s no good non-redhat (but still “production capable”) alternative to CoreOS nowadays, right? It’s pretty much Fedora / Redhat CoreOS or go directly to things such as k3os?


The rancher stack is pretty amazing.

Elemental is pretty close to coreos: https://github.com/rancher/elemental/

They even have a way to build arbitrary os images: https://github.com/rancher/elemental-toolkit

It's pretty great



k3os is in a dieing limbo, now is the time to get some interest in using stuff like it


Proud to see my hometown quoted in the article.

Back in 2014, my once Chinese girlfriend suffered from Systemic lupus erythematosus (SLE), was first diagnosed when she was 12, then prescribed by a Chinese hospital a mixture of drugs to take every day without interruption - which appeared to have rendered the disease asleep. Fast forward to her 24th birthday and the disease re-surfaced dramatically. We promptly went to the University of Tours' hospital and had the opportunity to meet with some of the finest rare disease professors.. only to be told that her continued use of the drugs she was prescribed back then actually had to be reduced/stopped progressively as the disease was being controlled.. Instead, the continuous prescriptions significantly damaged her kidneys, and made her sick thyroid dependent and accustomed to the drugs, allowing the disease to come back in full force, in a much more aggressive fashion than before and that it would be now hard to combat. By her 25th birthday, and while we had been able to share unique times of happiness that year during remission phases - flares would make sitting up on her bed and buttoning her shirt nearly impossible due to the constant joint pains / stiffness, fatigue, etc - and her immune system would often turn against her own organs.. eventually pivoting to her brain, causing severe confusion and depression. Shortly thereafter, she took her own life to put an end her own sufferings.

It deeply pains me to witness the US & Chinese health care systems, such tremendous powers with staggering economies, being nowhere near the quality of the free healthcare provided by France - to French citizens and temporary residents alike.


Agreed, that definitely sounds fishy as hell.

Organizations where the big $ is on intel property and R&D wouldn’t be dumb enough to give the actual schematics to some random intern building an app. There are some serious security controls and mNDA around that data, for all internal parties, external parties and suppliers. Think about it, they must collaborate with dozens if not hundreds of parties to get to a finished products, and some amount of information must be circulating to achieve the purpose of their relationships. If those were to leak - they’d leak to the competitors first and it’d be a disaster for the entirety of the upcoming year(s) as the improvements tied to their new designs would now be considered available on all cars. It doesn’t take two dummies to think about that, soooo.. to avoid losing hundreds of millions, what do you do as an organization? Data classification, risk assessments, tight security controls, procedures, and finally mNDAs for the instances when data do have to be shared.

Furthermore, as it was pointed out on Reddit, the model appears to be made of parts that are valid for 2020 regulations, some parts for 2021.. in other words, it’s just scrap.


Are we now to be held responsible for our policy maker's incompetency? The government should work for it's people, not the other way around. Had they considered the virus real, we wouldn't have had to seclude ourselves and stop the economy for over a year.

Taiwan, mainland China, Korea all have been living a perfectly normal life for the last 8 months. Mentioning those countries because I've actually spent a good amount of time there while the US government were shitting themselves.


Title is sensationalist, while the news really ain't a big deal IHMO. Adversarial attacks against Neural Networks, or any other heuristics for that matter, are widely recognized in the industry.

Frameworks/papers aiming at benchmarking and fine tuning networks against those attacks even exist (e.g. github.com/QData/TextAttack in the case of NLP).

The tools will simply be adapted. Either to a/ identify the statistical anomalies/artifacts introduced by the manipulation process, or b/ ignore the manipulation through further training or further data normalization.


Head of DevOps at a major financial exchange where latency & resiliency is at the heart of our business, and yes, we pay Cloudflare millions. I see two things here:

# Just be ready

Most definitely not the first time Cloudflare has had trouble, just like any other system: it will fail eventually. If you're complaining about the outage, ask yourself the question: why were not you prepared for this eventuality?

Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

# Management practices

Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

If you look at the power rails of serious data centers out there, you will quickly notice that those systems, although built 3x for the purpose of still being redundant during maintenance periods, are heavily safeguarded and automated. While technicians often have to replace power elements, the maintenance access is highly restricted with unsafe functions tiered behind physical restrictions. An example of a common function that's safeguarded is the automatic denial of an input command that would shift electrical load onto lines beyond their designed capacity - which could happen by mistake if the technician made a bad assumption (e.g. load sharing line is up while it's down) or if the assumption became violated since last check (e.g. load sharing line was up when checked, became down at a later time - milliseconds before the input even).


>>> Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer.

Which can't be done because it invalids the point of using CloudFlare!

CloudFlare is used to protect your site from DDoS attacks and ransoms. It has to hide the IPs of the servers otherwise attackers will DDoS the servers directly, bypassing CloudFlare.


Then you use > 1 CDN and switch traffic away from the faulty one. Also, if you serve a very large amount (as in different, say tens of thousands of different images) of data, 1 % of traffic is not enough to keep the other CDN caches warm.

I know of at least one site that works this way which has allowed them to weather CDN outages.


This isn’t an uncommon setup for large corporations.


You can use cloudfront or another cloud WAF service as your alt DNS. I think Akamai has a solid one if you can afford it.


You can also use CloudFlare for caching, or as a WAF, which you might not care about during a relatively short CloudFlare outage.


> and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely.

Except if you're using CF for DNS service, this wouldn't have worked, as both CF's website & DNS servers were impacted by the outage.


That can't be possible, CF's website explicitly says that their DNS is "always available" with "unparalleled redundancy and 100% uptime"! ;)

In all seriousness, I wonder if they are going to have to change the marketing on the site now...

https://www.cloudflare.com/dns/


Somewhere, someone is saying the marketing is fine since 100 has just 1 significant figure and therefore two nines -- no, one and a half nines -- can safely be written as 100.


If they are truthful it will now probably change to 99.997% uptime or similar. I expect that's still good compared to many DNS providers.


Some dns provider only have one name server which is naturally bad.

However it is a bit bad of CF that a single configuration error can bring all the slave servers down. It mean that they have no redundancy in term of BGP misstakes. Customers of CF that want to avoid this would benefit to add an additional slave server outside the hand of CF.

Zonemaster (dns sanity checking tool) actually complain about CF hosted domain names because of the lack of AS redundancy. The outage yesterday demonstrate nicely why that is an concern and why one should care. https://zonemaster.iis.se/?resultid=7d1fab165987e195


Same goes for route53 too unfortunately


Yeap, that's specifically what I also implicitly meant by "Spread your name servers" (besides having them distributed). To use this technique, you also must have a "Business" account with Cloudflare ($200/mo), so to leverage their ability to front your websites using CNAMEs : )


Ah, I see.

This got me Googling, and best as I can tell, CF doesn't support zone transfers. (They support being a client, but not a server. So, they could function as one's secondary system, but not as the primary.)


I imagine/hope a lot of the heavily engineered ops teams don’t use zone transfers, instead using APIs. The idea of primary and secondary wouldn’t really a strong indicator at that scale.


Sure, I could use the CF APIs… but to do what? AFAICT with a quick look over the documentation, there isn't any way to tail a log of changes being made to the zone. (You can export the entire thing in BIND format, though, the example in the docs has several errors in it that make me wonder how well that would work.) (The idea with zone xfers is that it is at least semi-standardized, whereas CF's API, while useful, is not.)

Then I'm stuck with a bunch of bad questions about how often to poll, and whether CF's rate limits would support any reasonably quick poll interval.

(The big problem is that we have other tooling that relies on being able to update DNS, the big one being ACME for certificate renewal. The changes it makes to CF would need to be rapidly replicated out to the nameserver.)

(Nothing in the post really strikes me as particular to CF, either. I think I could easily replace everything I've said here with "Route 53" and end up in the same bucket, maybe plus or minus zone xfers working.)


Using CF for DNS is, IMHO, a bad idea in general, especially for large sites.

We use AWS + Azure + GCP (yes, all 3) as our authoritative NS and keep them all in sync with octodns.


Ah ha, This probably explains why my internet stopped working for a bit. I had the 1.1.1.1 app from Cloudflare installed on my phone.


> It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Cisco terminal

The output is from a Juniper router, not Cisco.


Thank you! Now who's embarrassed? ;-)


99% and 1% or most Fail Over setups hardly work in practice smoothly unless you have lot of money to invest in teams and hardware and do DR drills constantly and keep standby infrastructure ready to handle full load . It may work in your industry where the infra cost is trivial compared to the risk and money being made. In typical SaSS apps infra is enormous part of the costs, keeping standby ready is not feasible at all.

It is also that typically even in large organizations companies with the money and people, fire drills and DR drills go the same way, it is known there is going to be drill and people react accordingly. Chaos Monkey style testing/drills rarely happen.

I would say building resiliency to your architecture is the key to this. Just like having a single customer > 50% revenue is enormous risk for any business , relying on any single service provider is also enormous risk . In manufacturing it is common to insist on second source for a part, IBM did that to Intel for the PC which is why AMD got into x86.

In this case a proper HA would serve better - minimum of 2 CDN networks always sharing 50% of the load and the have capacity to double if required. If they cannot scale that much then distribute to 3-4 and keep traffic no more than 25-35% per provider , such that a loss one means only additional 10%-20% traffic to the rest.

Also it is important that two service providers should be actually different, if they both depend on the single and the same ISP or backbone to service an area, it is not going to be effective.

The principle should apply across the entire infra Name Servers, CDNs, load balancers, Storage, compute, DBs, Payment g/w and registrars ( use multiple domains example.com example.io each with one registrar).


> The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

How do you justify the cost? I'm seriously asking - I have had a hard time making this pitch myself, I am curious if you have (recent!) experience with this.


Ah that one is surprinsingly easy, you justify the cost by facing facts. Did your company lose money during the downtime, and if yes is that sum more than what it would cost to have this redundancy ?

No ? Then the costs are not justified, and while it would be better from a tech perspective it makes no business sense.

Yes ? Well then you spend X to save Y, with Y being greater than X, so it's an easy sell as long you don't start with "cloudflare is never down" (which is not true).


I always assume a service (eg Cloudflare, AWS availability zone) will be completely down for a minimum of 30 minutes, once a year.

It’s worked surprisingly well.


Yeah, that's roughly 99.99% availability, which sounds reasonable for most anything you want to depend on.


Not somebody who'd need to make this decision but: I guess it'd depend on the cost to your business of a half hour outage.


> Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

If your service does scale at first place, then you don't need Cloudflare most of the time.


faeyanpiraat's point, but also, despite this failure, let's not dismiss the fact that Cloudflare brings unique (i.e. difficult to replicate) features (hence their success) a/ ability to identify threats at a global scale using a massive aggregation of data b/ ability to stop malicious actors close to their sources thanks to their large grid of POPs & their use of the anycast routing model.

Sure, anyone can scale my localized infrastructure for the traffic of 100,000 IP Cameras. Can anyone do it for 10,000,000 pwnd devices? Sure, but it'll likely start not being so practical without multiple POPs. Do I want to hire a dozen network & threat detection engineers to build/maintain that, complicate my processes, and pay for the infrastructure moving forward for a once-in-a-year event? Not really, no.

The way I see it, Cloudflare acts just like an insurance policy. Pay for a fraction of the actual cost, get your back covered, and profit from the expertise when it hits the fan.


I used to run a cryptocurrency website. It would get 50-100gbit+ DDoS attacks on a daily basis. This was a number of years ago.

DDoS mitigation providers wanted absolutely absurd amounts. Cloudflare took me on for $200 a month (I had confirmed beforehand). Mitigated all the attacks. All tickets were responded within minutes by network engineers working to mitigate the attack.


Making something scale and scale cost efficiently is two different things.


I would have assumed CF had a simulation of their entire network(including their peers) where changes would be applied and vetted before rolling it out..


Networking in general is a far less sophisticated world than we might like to hope. You have to deal with quirks of vendor-specific firmware, creaky protocols, and so on, and the culture of networking has been a bit behind some other areas of software in embracing testing in the way you describe.

We'll get there, but it's no surprise CF isn't doing this today; it would put them waaaay ahead of the pack if they did.


Nothing stops you from replicating your backbone network using a bunch of vMX VMs and testing your changes on it.

Would not catch weird firmware quirks in the real hardware, definitely would've caught this fat-finger typo.


Well, the thing that stops you is the cost of designing, implementing, maintaining, and scaling the replica testbed. On a large network, that would be pretty hard to justify to most organizations, which would see it as very costly with a tough-to-measure upside.

Have you done this before? I'd be interested to hear how those conversations went.


For an organization like CF? Yes, I would expect them to have testing and network simulations down to an art.

If I had to guess, I'd say it's because network engineers simply don't need / get this know-how on normal scale. Most SW developers on the other hand are not very good (good enough for CF scale) at networking. Which leads to networking guys doing their thing the way it was always done... (hope I didn't offend anyone, just guessing)

I hope they strengthen their dev department... I know I'd love a challenge like that. :)


This is understandable for most organisations but not networking centric businesses like Cloudflare.


I have to wonder if it would have. Unless you have some kind of route visibility collecting took, or a bunch of simulated traffic sufficient to pop the CPU on the vMX that represented atl01, it would all appear to work. I wonder if you could generate traffic, and scrape snmp counters as a proxy?

Or some kind of tool that processes the resultant routing tables to generate some kind of "route usage" for every given link and device, maybe even feed it with a table of expected traffic to given destinations.


I kind of would? If you’re running a private backbone with these number of PoPs, wouldn’t things be more sophisticated?


That's a nice theory, but the majority of power disruptions I've ever faced in data centers came from planned work on UPSs that went bad. If you want the quickly lower the reliability of a system, put it on a UPS.


>> we pay Cloudflare millions

>> just like any other system: it will fail eventually

An analogy combining these two points: one could pay a million dollars AN HOUR to the top software engineers alive on the entire planet… and at some point you will encounter failure. Technology and humans are both fallible, end of story. This is why SLAs exist with specific uptime targets to meet, and reimbursements should that SLA be broken. Anyone who believes the rare outage is unacceptable: fine; bring that layer in-house and pay engineers millions a year to do the best they can. You'll still encounter failures–and likely more of them.

The usual cry you'll hear from some business is "we lost $x million during the downtime!". Yes, and without some company like Cloudflare in front of your business, you'd probably be losing $x million multiplied by orders of magnitude you don't even want to imagine.

"You can't have your cake and eat it too."


> This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

Are you thinking of a cloud-computing context here? Seems to me a lot hinges on this, but perhaps I'm misunderstanding you.

If so, this would answer the scale question, and would presumably translate into increased prices until the incident is over. (I'm assuming CloudFlare offer a cheaper solution than doing it yourself on a cloud.)

If not, and you own the physical capacity yourself, wouldn't you do away with CloudFlare entirely?


> If not, and you own the physical capacity yourself, wouldn't you do away with CloudFlare entirely?

Cost could be an issue. We had something similar (not in the same context) in a company I worked for before. We could shift traffic, but that would cost 2-3x more, so it was not the preferred path unless we had problems.

It surprises me that many (big) companies did not learn the lesson already. We had a similar thing happening already years ago with dyn in 2016 (https://en.wikipedia.org/wiki/2016_Dyn_cyberattack), and it was surprising how many companies relied on a single DNS provider.


Interesting, thanks. Didn't expect a CDN to win out on price against in-house capacity.

Presumably this is a function of scale? At a certain point it's going to be worth running your own CDN.


Running your own CDN that is competitive with cloud flare and the other top CD ends requires dozens to hundreds of edge servers distributed around the world, close to your customers. This is very expensive, and while it does make sense for the very largest companies, almost everyone else is going to do better paying for a piece of large-scale shared infrastructure.


> # Management practices

>

> Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

We don't know if this was entirely the case, based on the timeline for the initial incident that prompted the change gone awry, there very well could have been an ITIL-Style CR created and processed within this time.

Judging by the edits made, this wasn't just simply taking a POP out of service entirely, but reducing the amount of (or eliminating all of the) traffic from neighboring POPs sent to compute at the ATL location. I can't image that this exact type of change is all that common. BGP anycast actually makes things significantly more complicated when removing edges.

As far as the mechanics go, with junos's CLI, there's not a lot of difference between what the intended command would have been, and the one that actually happened.

---

What they probably wanted

| example@MX1> configure

|

| {master}[edit]

| example@MX1# edit policy-options policy-statement 6-BBONE- OUT

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# deactivate term 6-SITE-LOCAL

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# commit

---

What might have happened

| example@MX1> configure

|

| {master}[edit]

| example@MX1# edit policy-options policy-statement 6-BBONE- OUT

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# deactivate term 6-SITE-LOCAL from prefix-list 6-SITE-LOCAL

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# commit

---

Initially, this seems like quite a bit of difference, however, Junos has a hyperactive autocomplete that triggers on spaces,. that deactivate could have been as short as "dea ter 6 fr p 6"

I'm not aware of any routing simulation product that is able to simulate complex bgp interactions, and report on effective routes of simulated traffic, as well as CPU load predictions. The closest I am aware of is running GNS3 (or a bunch of VM routers) overnight and capturing SNMP.

On the other hand, automating these kinds of changes would seem trivial, but such a service would have to be as fault tolerant as any other project, but is most certainly a worthwhile endeavor especially since integration is actually relatively easy, Junos provides some nice REST and XML APIs on the management interface that can do pretty much everything the CLI can, except start a shell.


Thanks for the detail. There are a lot of people in here who are saying "why didn't they just test their changes before applying them?" and I don't think they really understand how hard that is and how rarely it's done.


Peer review should always be possible, perhaps CF already does it and it got slipped in the review, reviews only reduce errors and not eliminate them.

It is difficult to write automation to cover all the tasks you would do, even if you cover the ones most commonly done, you will have higher risk on the rest.

A linter or higher level instruction set which well tested may be better solution perhaps. Automation if any perhaps should be after that ?


I think the mistake could be assuming that empty "from" statement would not match any routes while in reality deactivating everything inside "from" statement removes it altogether and makes the term match all routes which is indeed somewhat unexpected.


> Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

Nailed it.


Q is one most important language here at BitMEX. Complicated or not, it's on KDB that all our trades happen, up to $8B in 24H volume last summer - it's highly efficient at its task, and never made us or any of our customers, loose a single Satoshi.


With all due respect, requests to BitMEX during times of high load have >50% probability of hitting a 503 Service Unavailable error. Meanwhile, competitors with similar volume/activity don't have this problem.


They've done studies, you know. They say 60% of the time, it works every time.


If only this would support PRs / code review with revisions ... !


We're working on something that I think is really good for PRs and code review. It's part way done, and already I think it's a big improvement in usability over the web based interface on github.com. But, in the interest of shipping Ship, we had to cut it from this 2.0 release.

(btw, I'm James, I wrote the client side stuff in Ship)


I have never heard of that. Could you please try again? If you can repro, open an issue and I'll be glad to fix it. Clair now tells the layer in which a vulnerability has been detected, the package name and version that cause it and how it can be fixed. It should definitely help.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: