> Thank you for your ongoing patience as our team continues to investigate the disruption impacting multiple Square Services. We remain committed to providing you with timely updates, and we'll have another update within the hour as we gather more information from our Engineers.
> Our engineering team are actively working to identify the issue. All hands are on deck, and we'll update you as soon as we have news. Thanks for your patience again!
Does anyone think this is the worst status page ever? I don't think this kind of corporate speak belongs on a status page - it's unprofessional. I don't want to hear anything about my patience or understanding.
Probably some slight details of what is actually wrong, and where engineers are at in the process of fixing the issue. I don't think customers are looking for AWS-level incident details or anything, but people would like to know what is happening at a very high level, rather than completely generic updates for 16 hours.
Genuine question - what is the purpose of that? Does hearing “we deployed a bad DNS configuration that caused cascading failures” change anything by about the situation?
How can a software company with $17 billion in revenue be down for more than 8 hours? It seems hard to understand how failover systems could not have been successfully spun up within a couple hours of a total system meltdown, maybe someone else has insight into this. Square seems like it'd be an order of magnitude less complex to keep running and swap to a failover, say compared with Youtube, Amazon.com, or other billion-dollar grossing software products (I could be wrong).
No, of course Square is harder to fail-over than YouTube. There's no such thing as a lost monetary transaction on YouTube (a lost ad click at most).
From financial liability point of view, and also from PR point of view, it may be better to completely shut down and not accept payments than to take money from a buyer and fail to transfer it to the seller, or bill somebody twice, etc. Eventual consistency is OK, but final consistency should be guaranteed.
Because of that, it's harder to run stuff like chaos monkeys, DiRT-style exercises, etc, in an institution that handles relatively large and well-tracked transactions. Hence it's harder to stay prepared to various disaster scenarios.
The expense to build the (extra) failover-ability for a case of rare catastrophic events is much larger than the fallout from one such event. It can only pay for itself over a long time, and would be built (maybe has Bern being built already) during a long time.
Facebook had that 12+ hour outage last year. Meltdowns like this often lead to cascading failures followed by queues so full of stuff you immediately die when trying to come back.
There’s bound to be lots of very stressed engineers at Square right now
Also the Incident history(https://www.issquareup.com/history) is wild in my opinion. During August there were 31 incidents, some of them lasted for hours
Management that refuses to fund or mandate any common infrastructure.
Jack likes running his companies like "an economy of startups" but in practice this means nobody is allowed to do things without fast returns. Teams are encouraged to use whatever programming language and infrastructure they think fits them. It is the fast track of tragedy of the commons problems.
I worked at Square from 2014 to 2016 and this wasn't my experience.
In fact, Square invested a lot in infrastructure that was pretty impressive compared to other companies of a similar size. We're talking 800-1200 people and they had a single common custom deployment system that had been built before Docker was a thing, and later replaced that with a new in house system (p2) based on containers while I was there. They had a common custom CI system that all services used, and that actually worked quite well (much nicer than Jenkins which was common at the time). And Square really only had 3 languages that were acceptable to use for services (Java, Ruby, and Golang; node.js being phased out). I remember interviewing people at other larger companies and finding they didn't have things we had at Square in 2014, like a common way of configuring and registering services, or an easy way to see what's running on a particular box, or service-to-service ACLs, or a tool to coordinate online schema changes. Square even built their own S2S RPC protocol based on protobufs, and went on to participate in the design of gRPC.
Maybe it's changed since then, I don't know. But the culture was pretty solid when I was there. And the plateng team ran multiple DC failover tests of our service while I worked there.
Yeah, all that stopped being directly supported and stopped being mandated for use. There was a strong cultural rebellion against "senior engineers dictating policy"
Those are all garbage collected languages so I am betting they are reaching the size where they're going to need to transition to stronger iron like rust or c++.
Well no, fortunately the real world doesn't revolve around language wars. Those GC languages are well vetted and system architecture will tell whether their services will scale or not
Lol I’m betting they wont. GC has pretty much nothing to do with squares real challenges - you’re confusing your personal preferences with actual business utility.
I went to buy lunch today, and the shop was like "we are only taking cash right now." I didn't have any cash, so I went to a different place across the street. That place was only taking cash because their Square machine was down as well. I mentioned to the guy that the place across the street was down. I saw him go to the place next door and ask if their Square machine was down too.
Kind of crazy how one company's outage can bring down most of the places on a block. Everyone just relies on Square for their payments.
Are you going to consider carrying cash going forward? Maybe it's where I live more than anything, but it seems bizarre to read this thread that so many people carry no cash at all anymore.
I advised my wife to always have $40 on her. When she asked why, I said what my Dad said to me 20 something years ago..."you never know when you're gonna need it."
Sure enough, at least a few times a year...whether from down POS systems/phone lines or a yard sale or some side of the road trinket seller, it does come in handy.
There's usually at least two single points of failure, the merchant processor (Square in this case, but other retailers use different processors) and the card issuer (Visa, MC, etc) responsible for approving/declining a transaction.
> How does VISA credit care actually work? (genuine question)
Full answer at the bottom. I hope this answers your question.
> Can a cybro-terrorist shut down credit card by DDoS VISA's server or something?
No. There's some nuance though[2]. You can't shutdown VISA globally, but you can probably target a set of customers for a particular bank, or a set of customers in a particular suburb.
In a single card transaction there are multiple parties (none of whom trust any other party): the acquirer, the issuer, the merchant bank, the cardholder, the cardholder bank, the PoS and the transaction switch. The transaction flows through some or all of these parties.
1. The acquirer[1], the issuer, the merchant and the merchant bank will all configure the terminal for when to go online and when to approve offline (using the card information and the cardholder information). Whose settings take preference depends.
This means, in effect, that small transactions (like your lunch) are approved offline and so even when the issuer service is offline for 16 hours, most people won't notice.
2. When the transaction is forced online it isn't going to VISA anyway. It's going to a switch (probably hosted by the merchant's bank, or a party that the bank hired to switch transactions).
3. The transaction is then switched to a bank, and the bank approves or declines the transaction. It may be the merchant's bank that approves it and reconciles with the cardholder's bank at midnight. It may (depending on the risk evaluation of that transaction) go straight to the cardholder's bank.
4. The bank verifies the transaction against a secure security service (Frequently called the HSM - Hardware Security Module) which is provided by VISA (or if provided by someone else, has to be certified by VISA/MC/Europay). The HSM decrypts the encrypted 38-byte payload from the card and verifies if the cardholder and card are genuine.
5. The bank hosting the HSM can then decide to either approve or decline the transaction. If the transaction is approved then the bank will reconcile with VISA at some later point (generally once a day at midnight, or similar).
6. In very rare cases, the bank might decide to ask VISA to approve the transaction. I've never heard of this happening (because I never worked on the backend side of things past the bank). If this happens, it's a relatively new thing (in the last 20 years or so).
So, the answer is "no", you cannot shutdown VISA cards by shutting down VISA's entire network. If VISA goes down the vast majority of online transactions will still work, and every single offline transaction will still work.
You have to DDoS every single switch at every single bank to make online-only VISA transactions fail.
Bear in mind that a single bank hosts multiple switches to handle the load, as well as spreading the transaction load across multiple third-party switches too.
The entire global card network is handling billions of transactions per second, and is set up so that a peak of double the current load doesn't break it.
Even if you managed to globally pwn every single desktop computer currently powered on with internet access, there may still not be enough resources to overload the global network to such an extent that you knock out card transactions globally.
The best you can do is probably hit a single bank or a single group of banks (which still wouldn't make a difference to most merchants as most have more than one bank anyway), or take down all internet in a single area (knock out fibre, copper and cell towers all at the same time).
The system is built for resilience and poor networking infrastructure, by adding in risk of fraud. People tend to forget that, for 4 decades, credit card transactions where a 24x7 0% downtime service. Well before 99.999% uptime on HA cloud providers were a thing, transactions were working just fine, just with an added cost to cover the fraud and only localised downtime (a single bank, a single suburb going down, not the entire network).
This is also why crypto-currencies bring no value to the transaction/payments space in terms of resiliency.
[1] Acquirer includes recent things like those USB/BL card-reader dongles you buy for android phones that are EMV certified with the app they come with.
[2] I may be misunderstanding, or even misremembering some of the details, and some of the details may have changed in the years since I first worked on this stuff.
"Every transaction on Square can involve up to 4000 RPC calls to as many as 2000 microservices in eight availability zones. Distributed systems are difficult to get right."
Oof, yeah - they are a bit tone deaf too when you speak to your account exec. "We have the right people on the job" - just got posted to the status page as well - looks like Full PR Unit
Boy, I'm glad I've been on the road all day and not fielding phone calls.
I've got a customer who uses Square for their main online sales. Interestingly, a few dozen calls per hour are going through. The raw logs hundreds of lines of
`Square\Models\Error::__set_state(array(
'category' => 'API_ERROR',
'code' => 'INTERNAL_SERVER_ERROR',
'detail' => 'An internal error has occurred, and the API was unable to service your request.',
'field' => NULL
))`
This is absolutely killing me (https://quailhq.com) -- I've got hundreds of antique stores trying to get us on the phone, all wanting to know why their card readers have suddenly stopped working.
I'm not the person you're responding to, but I built POS on Square. Or more accurately, Square is the merchant processor that's integrated into a very large piece of operations software I wrote/manage for a regional retail chain. The simplest explanation is that the company was already using Square when the software was commissioned, and didn't see an advantage in changing processors. They already had terminals everywhere provided by Square. The software gives customers the ability to put their cards on file (stored by Square, when they make purchases through the online store) and gives the retail outlets the ability to charge those cards on file. Meaning most of the transactions are done through the Square API now, not the terminals. The terminals serve as a backup so they can take payment in person if our servers go down.
In any case, I built all the software to talk to an in-house, isolated Payments API that I wrote, so that every part of our code that processes payments sends requests to our API, which handles all the back-and-forth with Square and distills the results. If it were necessary, I'd just change that API to talk to some other processor. But that wouldn't be a one-day job. As it is, it would take days or weeks to even arrange a banking relationship between the company and a different processor. I've argued for a backup to be built to a second processor, but the company hasn't wanted to deal with the cost or the hassle. They would also then need to have two separate terminals at each checkout register, and we would need to store customer cards on file with two separate processors.
Even so, perhaps this week's events will change their mind.
But I'm curious what you think would be a "more robust" option, for API access + physical terminal, since aren't all merchant processors prone to the same potential issues?
I used to write stores that connected directly to VeriSign. But even if I were advising a corporate customer from scratch, afaik they don't offer free terminals to the stores, which is important for physical retailers.
Did Square provide terminals gratis? If so, that's a great deal especially in a multiple station environment. I can get free Lane 3500s all day long but touch screens and scanners usually have an upfront load.
Do you receive a commision from Square as compensation to build on their platform?
My local coffee shop gave up and closed after the 4th hour of this outage. They tried accepting cash for a while, but apparently nobody in Seattle carries cash anymore.
I wonder this as well. I am sitting here with a deadline to run payroll in 38 minutes. I cannot login to square to see how much tips to pay everyone... I know I shouldn't wait until the last minute but such is this life some days, its going to effect when payroll is deposited in their accounts.
Not to mention we cannot sell anything, thankfully we have a second point of sell as an o shit backup on that part.
Tips calc you almost have to wait till last minute - unless your running payroll offset in the past. That's not a you problem, the system should be available.
And, for Square, at least split reporting from processing. Why have two critical systems down at once?
It's wild. This thing has been down for 5 hours. I cannot imagine the insane amount of data they will process when everyone comes back from offline mode. They're at POW point and we still have no idea if they've even identified the problem yet.
Wow, saw this happen at the coffee shop round the corner.
They had no back up, so reverted to cash only.
We’d already ordered, so we’ll pay tomorrow.
But seriously wtf. If square intends to be the only for of payment for a business (an envious position) then they simply can’t go down. Their product should have enough redundancy, or some form of backup/offline pay, to ensure this never happens.
Cash is a back up, and probably one of the most robust back ups possible in the event of a technology failure. It is, however, unfortunate that many customers have given up having more than one form of payment. Paying tomorrow is also a form of back up. Which back ups are chosen is a trade off and introduce their own risks (in this case, risk that the form of payment is one that people increasingly don't have, or the risk that you won't be back tomorrow to settle up).
There is a backup. It’s astonishing how many people don’t seem to know this. If you cut internet to the terminal you can make offline payments and they’re stored for 24 hours. I do IT for a retail chain that uses square and we did this. People can still buy stuff just fine (as long as its under $200 in one transaction)
They’re still down and just put out a statement that they’re still looking for the root cause, which is unbelievable to me that after 9 hours they don’t know what happened.
I'm kind of surprised at the statement that Terraform is bad at bootstrapping things like Kubernetes, not the statement on it's own, but in context of using Talos. Yes, for a lot of roll-your-own Kubernetes cluster distributions, it isn't great at it and implementations are somewhat badly maintained, but for Talos specifically it's actually a very nice experience. They've done a good job on the provider and made it possible to bootstrap in an idempotent way, and it helps manage the lifecycle, upgrades, going forward by talking to the Talos control plane after it's bootstrapped. It's still being actively developed but I think their approach works better than most, and in some ways feels nicer than trying to bootstrap something like EKS with terraform. https://github.com/siderolabs/terraform-provider-talos
I hope Square has some insurance in place to fund merchants any lost revenue. It would be nice to deposit a percentage of average tx revenue per business.
It happens more often than you think. Plus, odd issues, like a glitch with their servers causes the card reader to not work. Theres a ton of device verification and security checks on those readers that bounce back and forth between their servers. Square having a bad day? Everything works, but your reader wont sync, wont read cards = you're done for the day.
Days like this aren't the "norm", but constant aggravating issues are pretty much a weekly thing.
It got so bad, we just got an old-school merchant account with a reliable card reader. ZERO issues almost with that, ever. It's a little funkier, but the merchant account is 1/4 the cost to run than Square. Not as fancy of a webui from the merchant provider, but eh, all the same connections to Quickbooks for the accountants.
Yeah I ran into this today and realized out of the options available in my wallet, I couldn’t pay the 7$ bill. Luckily someone in my party had cash but it did show me how reliant I am on services like square even as a consumer.
Good news for me. Somehow the local mediterranean place here had an issue with this so I got to take from the backlog of un-fulfillable orders for free :)
(B) Unexpected single point of failure (like a critical piece of hardware) and they need to rush-order a replacement.
(C) They can't identify the cause. (Doubtful.)
or
(D) Cascading failure. When they fix one thing, something else takes them down. Like a large queue of pending jobs that floods and crashes them when they come back up.
On one hand, I'm glad I don't depend on Square for purchases.
On the other hand, who's to say that my own payment provider can't have this same kind of problem?
> Our engineering team are actively working to identify the issue. All hands are on deck, and we'll update you as soon as we have news. Thanks for your patience again!
Does anyone think this is the worst status page ever? I don't think this kind of corporate speak belongs on a status page - it's unprofessional. I don't want to hear anything about my patience or understanding.