Slack is offline

Arathorn · on Nov 1, 2017

This is probably preaching to the choir, but hosting your own FOSS chat is nowadays a very viable way to avoid being dependent on a centralised service like Slack. Your options include:

* Riot.im / Matrix.org (decentralised global network; e2e encryption; open protocol)

* Rocket.Chat (Meteor-based; focus on UX and feature)

* MatterMost.com (clone of Slack UI; open core license)

* Zulip.org (all about threads!)

* ...or indeed IRC or XMPP.

(disclaimer; I work on Matrix).

acchow · on Nov 1, 2017

Is there reason to believe self hosting will have better uptime?

lstyls · on Nov 1, 2017

Uptime aside, you also have to consider the effect of a single point of failure when you self-host. If your in-house communication hub goes down when your site does, it's going to make firefighting that much worse and you'll pay for it in a longer outage.

Here at FB, a lot of day to day coordination takes place via FB products. But production and release engineering communication happens over IRC, especially during major outages. The fallback factor in critical to keeping the plane in the air.

synicalx · on Nov 1, 2017

People like to jump on bandwagon or the other, but the real answer is - it depends.

With Slack, the application itself is probably pretty tough, but for a lot of businesses their infrastructure and connectivity TO Slack (ie internet/WAN) is probably not very resilient. So for a lot of smaller outfits I'd say that Slack is better.

But if you're a large org and your infrastructure is very resilient and diverse, then you're probably better off self-hosting - assuming you can leverage your existing infrastructure to do so.

saas_co_de · on Nov 1, 2017

yes. small scale is almost always simpler and less error prone than massive scale.

slack also has pretty dubious quality standards. e.g. their desktop app and their atrocious replacement for screenhero.

nojvek · on Nov 1, 2017

The biggest benefit of slack for me is their search. All messages are indexed ready to be searched. Code snippets, images, giphy, attachments, bots. It’s a whole ecosystem, not easily replicable with irc.

flukus · on Nov 1, 2017

1. Self hosting doesn't have to operate at the scale of slack, so there's a whole slew of issues avoided. Pushing text messages around really isn't that difficult when you aren't serving millions of customers.

2. You can perform maintenance outside of office hours, with SaaS you don't get to decide when an upgrade (and potential outage) happens. I don't care about 99% uptime, I care about having 99% uptime while I'm working.

3. You can have backup services.

TylerE · on Nov 1, 2017

There's also a wholew slew of issues driven right into.

flukus · on Nov 1, 2017

Such as? If you've got less than 1000 users then you need an extremely basic server, a raspberry pi should more than suffice. Then you've just got a little bit of manual (or automated) administration, software updates and backups mostly.

I really didn't expect my post to be so controversial, is the HN crowd really so terrified about running there own hardware?

seanp2k2 · on Nov 1, 2017

I'm guessing that you're being downvoted because there's a lot more to consider. I agree that it doesn't take much hardware these days (most single-board computers would work perfectly well) to service <1k simultaneous chat users with efficient server-side software (e.g. UnrealIRCd or ejabberd). However, to make it as reliable as Slack (99.99% monthly uptime is their SLA) for the price they offer it ( https://www.slack.com/plans ) would likely take considerable engineering effort. Sure, you could set it up, toss it in a closet, and it might have 100% uptime for a year...until it doesn't. If chat is business-critical, there are chat companies that have profit motive to deliver a good service. If chat is a nice-to-have at a company (and you e.g. don't have to worry about data retention laws / compliance stuff), maybe it's fine to run it on an rPi / t2.micro (free) AWS instance.

Luckily, there are a ton of great free and paid options out there these days!

hedora · on Nov 1, 2017

For $6670 a month (price for 1000 users), I’m pretty sure most people here can spin up two VMs in two different colos, and setup IRC servers or whatever.

99.99% uptime means it can be down for a few minutes a month, so all it needs to do is fail over properly. In practice, it will probably have many more than 4 9’s.

I think the real reason slack does well is ease of client + service setup, the brain-dead UI, lots of feature creep that a few people care about, mobile clients, etc, etc.

I’m not a huge fan, but it could be worse. At least they didn’t leak everyone’s password like hipchat did.

flukus · on Nov 1, 2017

$80,0000 a year, for that sort of money you could hire an IRC developer full time and get them to spend a day or two managing the company server.

acchow · on Nov 1, 2017

How can anyone find a decent developer for $80k? Even not factoring in overhead and benefit costs.

flukus · on Nov 1, 2017

In most of the world that will buy you a decent mid level developer at least, a great senior or two in many. Even if it's below market, if this was my pet OSS project I'd happily take a pay cut to get more job satisfaction.

scaryclam · on Nov 1, 2017

Generally yes, given the reasons others have said. Other than that, at the very least, outages can be dealt with more proactively when you have your own setup. Third parties won't have the same priorities that your company does.

seanp2k2 · on Nov 1, 2017

Since Slack's main business is chat, they have a pretty good incentive to get everything working again ASAP. Here's their SLA for "plus plan" and Enterprise plan:

  Our Plus plan Service Level Agreement (SLA) guarantees a 99.99% monthly uptime1
  We’ve designed our SLA to be simple and transparent — based directly on the information we make publicly available on 
  Slack’s System Status page.
  If we fall short of our 99.99% uptime guarantee, we’ll refund customers on the Plus plan 100 times the amount your 
  workspace paid during the period Slack was down.

Source: https://get.slack.help/hc/en-us/articles/204113126-Plus-plan... + https://get.slack.help/hc/en-us/articles/115003205446-Plans-...

Chat is a commodity these days. For most businesses, it probably makes more sense to just let the companies in the business of offering paid chat services do their thing.

user15672 · on Nov 1, 2017

Don't see why you were voted down on this, since it's true. Slack working to get things running again doesn't mean they're prioritising your companies particular instance or region. They're likely to be making sure their own region and their own stuff is up and fixed first, so anyone away from the east coast of America is likely to get seen to after that. It would be stupid to do it any other way, since slack employees are likely affected as well and they're the ones trying to fix it. Down voting someone pointing that out is pretty fanboi-esk or really naieve.

Pretty much, if you don't own the service, you don't get to decide where in the queue you are for a fix.

anonu · on Nov 1, 2017

Do these solutions have good mobile frontends?

avhon1 · on Nov 1, 2017

I run an XMPP server for my friends. We use Conversations [0,1] on Android and BBOS, and Zom [2] on iOS. We use OMEMO [3] for encrypting most of our conversations, and while it isn't perfect, it usually stays out of the way.

Generally, the experience with the mobile clients has been quite good. Conversations and Zom are stable, attractive, and featureful. The biggest issues are some interoperability problems with desktop clients (displaying messages that should be hidden) and some things which I believe are server-side configuration issues.

Zom hides a some useful configuration features (in the name of being dead-simple to use), so I'm trying to convince one of my iPhone-owning friends to try ChatSecure [4].

[0] https://conversations.im/

[1] https://f-droid.org/packages/eu.siacs.conversations/

[2] https://zom.im/

[3] https://conversations.im/omemo/

[4] https://chatsecure.org/

rrix2 · on Nov 1, 2017

Matrix via Riot has quite a great mobile client

JeanMarcS · on Nov 1, 2017

I run my own Mattermost server and the mobile version is very lookalike the Slack one (I only use chat so I don't know if there's other functionalities on Slack that are not in Mattermost)

So I guess yes :)

artur_makly · on Nov 1, 2017

love MatterMost as it is ITAR compliant

_jx8r · on Nov 1, 2017

"International Traffic in Arms Regulations"?

How is that relevant to a chat app?

arturmakly · on Nov 1, 2017

if u work on military or DOD projects,u cant use most cloud platforms like slack. MM solves that compliance requirement

nategri · on Oct 31, 2017

Yo, people who like to complain that Slack just re-implements IRC... This Is Your Moment

mpd · on Oct 31, 2017

At least with IRC I could connect to another server.

LambdaComplex · on Oct 31, 2017

Or host your own

ovrdrv3 · on Oct 31, 2017

https://xkcd.com/927/

ilovecomputers · on Oct 31, 2017

https://xkcd.com/1782/

synicalx · on Nov 1, 2017

Clears throat

Flammy · on Nov 1, 2017

auscompgeek · on Nov 1, 2017

> Anyone who doesn't know, Slack actually runs off the IRC protocol underneath the hood.

[citation needed]

d4l3k · on Nov 1, 2017

Pretty sure they provide an IRC interface, but almost certainly don't use IRC internally. There's almost no way they could support any of their fancier features using IRC. Reactions etc would be horrible to implement.

helper · on Nov 1, 2017

Yeah and the IRC interface has been getting worse recently. When they added the shared channels across teams they completely broke being able to '@' a user from the IRC gateway. Support said something along the lines of 'yup and we're not planning on fixing it'.

I'm expecting them to completely turn off the IRC gateway in the next year or two.

flukus · on Nov 1, 2017

Embrace, extend, extinguish. It's the SaaS business model.

pkrumins · on Nov 1, 2017

I've switched to using TwistApp (https://www.twistapp.com) with my team. Unlike Slack where you have channels where everyone talks about everything, TwistApp bases conversations around threads. Every problem that's being worked on has its own thread. Once it's completed, I close and archive the threads. Very effective for getting things done as every task is isolated in a separate thread and discussions don't overlap.

Also read this post by Amir, the founder of TwistApp, "Why we're betting against real time messaging" - https://blog.doist.com/why-were-betting-against-real-time-te...

ivm · on Nov 1, 2017

Twist also has a native Mac app.

cadr · on Oct 31, 2017

When the status page is returning a 500 error... not a good sign.

traviscj · on Oct 31, 2017

On the other hand, makes it sound more likely to be a routing/reverse proxy issue instead of (say) a database issue. Those sound easier to deal with via a rollback vs something like "oops we dropped a critical index on the `messages` table".

nyrikki · on Nov 1, 2017

With the length of the outage it seems more like a DB issue. And if this is still correct there look to be some fragile dependencies.

https://aws.amazon.com/solutions/case-studies/slack/

The API isn't even returning valid error codes.

  logging error: {"subtype":"api_call_error","message":"{\"ok\":false,\"error\":\"_http_error\",\"status\":0,\"retry_after\":null}","stack":"Error\n

As you can hit their api servers I am betting some replication error in mysql but that is just a guess based on that case study.

I am betting they are saying 'connectivity' because that is the error the client logs.

nyrikki · on Nov 1, 2017

It is back up for me, but now I am annoyed they can't follow specs at all.

For l

They ignore:

https://standards.freedesktop.org/basedir-spec/basedir-spec-...

And kitchen sink everything under the xdg config dir...

  ~$ ls .config/Slack/
  Cache/                Cookies-journal       
  dictionaries/         installation          Local 
  Storage/        
  Preferences           QuotaManager-journal  
  Cookies               databases/            GPUCache/             
  local-settings.json   logs/                 QuotaManager          
  storage/

munk-a · on Oct 31, 2017

It's an even worse sign when the page half-loads with some stylesheets missing.

Florin_Andrei · on Oct 31, 2017

Maybe they should not serve the static content with "Cache-Control: max-age=1". That's rarely a good idea.

cadr · on Oct 31, 2017

Yeah, it alternates loading telling me everything is fine, and just giving me a nginx 500 error. Seems that the status page should be hosted differently so it can be up even when other things are having issues.

seanp2k2 · on Nov 1, 2017

Returned for me when I was looking but took ~30 seconds to do so. They should host their status page on a separate domain and use something like CloudFlare in front of it to help with sudden spikes in traffic. Another alternative is to use Twitter / Facebook as the status page and let them deal with the traffic spikes, or just serve static HTML.

Gibbon1 · on Oct 31, 2017

It's a sign that we need to appease the beer gods.

carlchenet · on Nov 1, 2017

You could use this time to read The Slack threat https://carlchenet.com/the-slack-threat/

NightMKoder · on Nov 1, 2017

I'm hoping they publish a public post-mortem. Learning from this kind of outage is the best kind experience for engineering - though it's far better when only staging goes down and not prod.

losingthefight · on Nov 1, 2017

Looks like they won't: https://twitter.com/SlackHQ/status/925586114152411137

We've no solid plans right now as we're focused on tidying things up internally, but will consider it. Thanks again for holding tight

lilyball · on Oct 31, 2017

Hasn't Slack learned yet that you're supposed to host your status page on a different infrastructure?

Johnny555 · on Nov 1, 2017

I think they did.

The slack.com IP's are owned by AWS, while status.slack.com resolves to some Digital Ocean IP's.

lilyball · on Nov 1, 2017

Then why did the Slack Status page have so many problems at the same time? Half the time loading it would give a 500 Internal Server Error, 45% of the time you'd get broken resources (images and/or CSS), and only 5% of loads would give you the full working page.

detaro · on Nov 1, 2017

Maybe they underestimated how much resources their status page needs during an actual outage.

Johnny555 · on Nov 1, 2017

Maybe because it's under a lot more load during an outage and they haven't upsized the status page infrastructure to handle their ever increasing user base.

philwelch · on Nov 1, 2017

DNS issues?

BrentOzar · on Oct 31, 2017

And of course today's the first day we're using Slack for audience Q&A at a conference. 360 folks in a room now have to...raise their hands! So barbaric.

kamilszybalski · on Nov 1, 2017

Slack is currently down and I've realized, for better or worse, what Slack has really done.. It's created an expectation for immediacy. I thought about sending my question to someone via email but then just thought, "I'll wait for Slack to be back up, it'll be faster anyhow".

matchagaucho · on Nov 1, 2017

My first thought:

"Slack is down? Better post to Slack and let the team know."

jonkiddy · on Nov 1, 2017

You're not alone. I had the same thought.

alexasmyths · on Oct 31, 2017

I worked for BlackBerry when outages started to become a thing.

When your business motto is 'always on' - it's really, really bad to be 'off' - it's a deep transgression of the brand promise.

BB was structured poorly for this - they didn't grasp the concept of multiple nodes of redundancy very well. (Easy in hindsight).

But - we should all be impressed at how highly available Google, FB and some other brands are. That's impressive.

a12l · on Oct 31, 2017

Team: if you're reading this, get out the radios.

oplav · on Oct 31, 2017

Here's the incident page: https://status.slack.com/2017-10/8b0d4d44ea53726f

kiloreux · on Oct 31, 2017

I just realized that being a remote team. I can't join any of my teammates. Neither see live changes to the infrastructure and the repositories.

The things we take for granted.

urda · on Oct 31, 2017

If slack being down means you lose all insight into your build process and code management, you seriously need to introduce a secondary option immediately.

typomatic · on Oct 31, 2017

OP didn't say "all insight". There's a difference between being unable to see a stream of change events and not being able to see the current state of the system. The latter is completely unacceptable, whereas the former is just annoying.

Scriptor · on Oct 31, 2017

If you're a fully remote team it'd probably be wise to figure out some fallbacks.

Johnny555 · on Nov 1, 2017

I'm sure they have fallbacks, but when their ecosystem (apparently) evolved around Slack, the fallbacks are less effective. Polling Jenkins to see when your job is done is more time consuming than receiving a Slack message.

ytjohn · on Nov 1, 2017

I'd like to message members of my team about this issue, but slack is down...

vacri · on Nov 1, 2017

Email lives on.

westoncb · on Nov 1, 2017

Maybe Slack decided to give us all a break for Halloween :)

Also curious: why is this so low on the front page? 250 points, posted an hour ago...

philfreo · on Oct 31, 2017

Reminder why your status page should be hosted in a very different way than your regular infrastructure... so you're way less likely to end up with issues on both at the same time.

I like how statuspage.io even has metastatuspage.com in case their primary domain/DNS/TLD has issues.

theboywho · on Nov 1, 2017

Reminder that you should check things first before commenting. Slack's status page is in a different infrastructure, somewhere using digital ocean while slack.com is using AWS.

nikolay · on Nov 1, 2017

Nice! I wondered why my productivity suddenly doubled up!

internalfx · on Oct 31, 2017

My RocketChat server is running just fine ;P

5ilv3r · on Nov 1, 2017

So is my jabber server!

peterlk · on Oct 31, 2017

This page seems to not be throwing 500s: https://status.slack.com/2017-10/8b0d4d44ea53726f

SolaceQuantum · on Nov 1, 2017

This is certainly scary for the Slack devs... Happy Halloween!

nouveau0 · on Nov 1, 2017

Spoopy day at Slack

chris_wot · on Oct 31, 2017

Causing chaos at my workplace. We have Slack integrated into our incident management solution... very, very unfortunate.

akulbe · on Oct 31, 2017

Slack DOWN! Productivity UP!

#jokingpeopleCOMEON

quadrant6 · on Nov 1, 2017

Wow, talk about realising how much we rely on Slack.. suddenly I feel so disconnected and alone.

https://www.customd.com/articles/65/slack-is-down

Willson50 · on Nov 1, 2017

That was the most annoying scrolling experience I've ever had.

urda · on Nov 1, 2017

Links that hijack my scroll wheel earn an immediate downvote.

legohead · on Oct 31, 2017

don't push code at 4pm! and on Halloween, oof..

jnaulty · on Oct 31, 2017

  We are aware of connectivity issues and are actively lnvestigating.
  3:58 PM PDT・See in your timezone

They spelled investigating with a lower-case 'l' :\ Does that bug anyone other than me?

fiatpandas · on Nov 1, 2017

When your hands are shaking from adrenaline because your pre-IPO company is suffering from a global outage, you might hit an l instead of an i.

theossuary · on Nov 1, 2017

No worse feeling than typing out a status update without any idea of what's going on.

bramen · on Oct 31, 2017

Just a simple typo. The keys are pretty close together if your finger slips, and I imagine they have enough problems distracting them from proper spellchecking at the moment. :-)

Waterluvian · on Oct 31, 2017

It doesn't bug me. But I really want to understand how that happens.

_lce0 · on Nov 1, 2017

Why CPU goes like crazy every time slack loses connectivity?

You can easily test this by disconnecting from wifi. As soon as you're offline the fan starts spinning until you get your connection back.

scott_karana · on Nov 1, 2017

I'm pretty sure that "refresh" tries to bootstrap the whole world. Eg, it'll reload all JS assets in addition to just restarting WebSockets

cheerioty · on Oct 31, 2017

That's why there's StatusPage.io :)

tooltalk · on Oct 31, 2017

I think you meant to say statuspage.io, not statuspage.com

jaredsohn · on Nov 1, 2017

FYI, slack seems to be up again now (at least for me.)

msarchet · on Oct 31, 2017

not anymore? - EDIT: the status page was fine and then it wasn't.

yellowapple · on Oct 31, 2017

Seems like every other request is throwing a 500 error. Maybe one server in a load-balanced cluster is erroring out?

4lch3m1st · on Nov 1, 2017

I'm not really a Slack user, but isn't Keybase always an option?

JshWright · on Nov 1, 2017

If by "always" you mean "as of a couple weeks ago, and with a small subset of the features".

c0smic · on Nov 1, 2017

Just a spooky coincidence it's on Halloween?

internalfx · on Oct 31, 2017

Is this getting flagged off of HN?

gtdawg · on Oct 31, 2017

Looks like it. When I first loaded HN 2 minutes after this outage started, the story was #2. Then I refreshed after 3 minutes and it wasn't on the front page at all. Used the search tool to find it and then upvoted.

internalfx · on Oct 31, 2017

Looks like somebody wants to keep this quiet. 65 points in 12 minutes is good enough to be #1.

sctb · on Oct 31, 2017

Outages without any information are not intellectually interesting, and typically neither are the discussions that follow.

minimaxir · on Oct 31, 2017

It is infeasable to keep a global outage quiet.

grzm · on Oct 31, 2017

High comment rates can also trigger the "overheated discussion detector", which will downweight a submission.

chris_wot · on Oct 31, 2017

Perhaps the HN mods should do something about that then.

grzm · on Oct 31, 2017

If you see something like this and you think it's in error, you can let them know and they'll likely be able to respond more quickly. There's a contact link in the footer.

chris_wot · on Nov 1, 2017

You're joking, right?

grzm · on Nov 1, 2017

No, I'm not. In my experience the mods are quite responsive, and have explained site behavior on more than a few occasions. They've also adjusted flags and weights of submissions if they identify an issue.

minimaxir · on Oct 31, 2017

Granted, there’s not much constructive discussion that can happen as an outage is happening.

internalfx · on Oct 31, 2017

Discussion? No, I agree there. But it _is_ relevant "hacker" news right?

yellowapple · on Oct 31, 2017

Probably just the ratio of comments v. votes. Too many comments relative to the number of votes will lower a post's ranking, IIRC.

I'm probably not making this any better by commenting, of course.

orthoganol · on Oct 31, 2017

How is there nothing on the front page of HN, slack being out for almost an hour now?

Given timeframe and upvotes, how is this not the top of HN?

5ilv3r · on Nov 1, 2017

It is now

coleca · on Oct 31, 2017

How can I tell my team that Slack is down? :-)

5ilv3r · on Nov 1, 2017

pagerdoodie?

artur_makly · on Nov 1, 2017

the loss of revenue.. is unquantifiable.

Narzerus · on Oct 31, 2017

O.o god help us

partycoder · on Nov 1, 2017

Slack has a nice market share, but also many competitors, many of them 100% ripoffs with the same features (to name a few, Attlassian HipChat and MS Teams... not to mention open source products).

Slack has been experiencing service degradation often lately, so I would not be surprised if people start switching.

In our team we already started looking for an alternative.

bojo · on Nov 1, 2017

HipChat came out long before Slack.

dbenhur · on Nov 1, 2017

Yeah and they've had service degradation and availability issues long before Slack also. ;-P

partycoder · on Nov 1, 2017

Did not know that, thanks for point it out.