Teaching Kubernetes

sandGorgon · on Feb 19, 2017

The first thing I do when I look at kubernetes tutorials is look at ingress or load-balancing... and usually I find it absent.

It is not the fault of the author (and I commend them for taking the trouble), but Kubernetes is super complex for getting traffic inside the cluster. But here's a genuine appeal from someone who has suffered this before.

In kubernetes, you have multiple types of ingresses - and you invariably need to have a combination of them to get anything done correctly. For example nginx-ingress will lose source-ip info and will not do L4 proxying very well. Which is why the unsaid point of all kubernetes deployments is "use ELB or GLB with proxy protocol and call it a day".

I get why - K8s is trying to solve the most complex problems first and trickling it down. Which means for the time being, ingress is a very hard problem to solve in k8s.

The second place where every one will get stuck is calico vs weave vs flannel vs whatever. Most tutorials say "use any one of them and let's move on". Unfortunately, that is not the case - choose your network plugins wisely.

After spending a lot of time in k8s, I setup a Docker Swarm cluster in about 3 hours. It does not have some of the more complex usecases supported but it does three things super well - networking, ingress and secrets ... and for most beginners, that's all they need.

P.S. i have the exact same stack running on kubernetes and docker swarm

Docker Compose yml form 3.1 is brilliant - other than my docker files, i need a 30 line yml file to deploy my whole stack across 3 machines. Kubernetes on the other hand needs atleast 12 yml files ("deployments" and "services"). I'm not counting another 4 yml files for statefulsets and persistentvolumeclaim since there is nothing equivalent in docker swarm

atombender · on Feb 19, 2017

> Kubernetes on the other hand needs atleast 12 yml files

You don't need to split your manifests into multiple files. You can use a list:

    apiVersion: v1
    kind: List
    items:
    - apiVersion: extensions/v1beta1
      kind: Deployment
      ...
    - apiVersion: v1
      kind: Service
      ...
    - apiVersion: v1
      kind: ConfigMap
      ...

Edit: Or YAML's "multiple document" syntax:

    apiVersion: extensions/v1beta1
    kind: Deployment
    ...
    ---
    apiVersion: v1
    kind: Service
    ...
    ---
    apiVersion: v1
    kind: ConfigMap
    ...

Then you just use "kubectl apply -f myapp.yml" to create or update.

Re ingress, I agree that it's probably the weakest point of Kubernetes at this point. It's particularly weak when it comes to internal load balancing. When you have lots of services that should only need to be available internally, you'll want to have an internal ingress; I don't know about AWS, but Google Cloud Platform's internal load balancer doesn't support pointing at Kubernetes [1]. I haven't found a better option than to run Traefik as a DaemonSet and rely on round-robin DNS (aka poor man's HA).

[1] https://github.com/kubernetes/ingress/issues/112

crb · on Feb 19, 2017

Or just put --- between various stanzas in one file: https://kubernetes.io/docs/user-guide/managing-deployments/#...

(Google now has a L3 internal load balancer, but not a L7 LB. It's fair to say that this will be made much easier this year.)

andrewstuart2 · on Feb 19, 2017

I really think the community ought to adopt this as best practice and actively discourage (or even deprecate) the List type. It's just too much noise and zero benefit IMO over raw slices or serialization as arrays (or document lists for YAML).

smarterclayton · on Feb 19, 2017

As the author (blame me) of the List type, the primary advantage is it needs no special logic in JSON for processing. New line separated JSON is wierd for a lot of libraries, and in the future we want to have endpoints that allow bulk creation / bulk apply.

atombender · on Feb 19, 2017

That's good news. ILB ingress integration is forthcoming, then, I hope?

atombender · on Feb 19, 2017

True! I keep forgetting the multi-document support in YAML.

sandGorgon · on Feb 19, 2017

>I haven't found a better option than to run Traefik as a DaemonSet and rely on round-robin DNS (aka poor man's HA).

you could try linkerd. It is built with precisely this usecase in mind

Also, it was not just a question of number of files. The size of my docker-compose.yml is about 30-40 lines or so.

atombender · on Feb 19, 2017

I think Helm [1] looks promising in this regard.

Helm "charts" are templates that generate Kubernetes manifests. To install a chart, you provide values (for which there are defaults). For example, here [2] is the chart for PostgreSQL. The default values are in values.yaml, and the templates for each manifest are in the templates folder.

Conceptually, this even cleaner than Docker Compose, because there's zero data that isn't specific to your install: Everything else defaults to a pre-defined default.

So in the same way that the official PostgreSQL Docker image is general-purpose, you can define a completely general-purpose chart that can be used by anyone. You customize it by providing overrides.

The only downside is that you still end up with Kubernetes manifests. It's not an abstraction, it's an automation tool.

[1] https://helm.sh

[2] https://github.com/kubernetes/charts/tree/master/stable/post...

tamalsaha001 · on Feb 19, 2017

You are absolutely right that getting traffic inside the cluster is probably one of most difficult tasks in Kubernetes. The standard ingress objects are too simple when you have a complex deployment. So, we wrote our own Ingress Third Party Resource to fit the needs of our deployment. We use HAProxy. We have been using this in prod for almost a year. You can find it here: https://github.com/appscode/voyager

If you find any issues or have questions, please file Github issues.

maktouch · on Feb 20, 2017

I found this last week and I'm very impressed because it solves a lot of the problems we had with existing ingresses.

I really like how you guys decided to do your own 3rd party manifests instead of using annotations.

I'm going to play more with this :-)

sandGorgon · on Feb 19, 2017

this is awesome!

please please submit this to kubernetes-incubator

tamalsaha001 · on Feb 19, 2017

Thanks for the kind words. The incubation process seemed too complicated (process heavy?) for small teams like us. But I will look into that again.

ahawkins · on Feb 19, 2017

Author here

> The first thing I do when I look at kubernetes tutorials is look at ingress or load-balancing... and usually I find it absent.

Yup! I dedicate a few minutes to this in the grab lesson because how important it is.

> Docker Compose yml form 3.1 is brilliant - other than my docker files, i need a 30 line yml file to deploy my whole stack across 3 machines. Kubernetes on the other hand needs atleast 12 yml files ("deployments" and "services"). I'm not counting another 4 yml files for statefulsets and persistentvolumeclaim since there is nothing equivalent in docker swarm

Interesting point. You may like helm (https://helm.sh) as a tool for deploying Kubernetes applications.

sandGorgon · on Feb 19, 2017

actually, docker compose file is so awesome that there is now a top level incubator project called kompose that is trying to provide drop-in compatibility. https://github.com/kubernetes-incubator/kompose

you should check out the #kompose channel on slack.

bdcravens · on Feb 19, 2017

> Docker Compose yml form 3.1 is brilliant

What's nice about ECS is that it lets you deploy using Docker Compose files (though you often make minor modifications for ECS)

    ecs-cli compose --file ecs-compose.yml create

However, there is the kompose project to bring Docker Compose to Kubernetes (it's still in Kubernetes Incubator at this point):

http://blog.kubernetes.io/2016/11/kompose-tool-go-from-docke...

https://github.com/kubernetes-incubator/kompose

shaklee3 · on Feb 19, 2017

Maybe you or someone else can help me understand the use case for a load balancer. I have an application where it has a very high rate of ingress traffic. So much so that while a load balancer sounds like it's ideal, a single node (the balancer) would not be able to handle the traffic. Is the load balancer only useful for applications with a low ingress rate that you're trying spread across many servers?

atombender · on Feb 19, 2017

How much traffic are we talking about? Google Cloud Platform's load balancers (which sit outside of Kubernetes and are replicated geographically) are designed for quite a high load of traffic.

Either way, your client has to know what pod (group of Docker containers running your app) to send traffic to. The pods run on hosts and ports that aren't fixed in time. So if you want some kind of client-resolving load balancing without involving a central bottleneck on the server, you'd have to transport that inventory of pod-host-ports to the client, and then either provide a host-based proxy or open the ports on the nodes themselves. And of course you'd have to build the balancing/failover logic into the client (which would either have to be randomly distributed, or rely on resource utilization info fetched from Kubernetes).

That may be great for some narrow use cases, but for most applications, the load balancing pattern is much simpler.

shaklee3 · on Feb 22, 2017

Do you know the details of that? On their site they talk about balancing in terms of queries per second on big query. What is that equivalent to in terms of Gbps?

hashkb · on Feb 19, 2017

Without a load balancer, how do you currently spread the load over multiple nodes in your app?

shaklee3 · on Feb 19, 2017

If you have control over the sender, you can, right? That's my use case and maybe that's why I don't see the value.

hashkb · on Feb 19, 2017

I see. A lot of us I think assume we are talking about web apps or mobile apps where you don't.

predakanga · on Feb 20, 2017

Load balancers are particularly useful when it comes to cloud applications because they let you keep your routing logic server-side, meaning that you can make adjustments quickly without needing to release a new client version.

Your load balancer definitely shouldn't be a bottleneck - you can deploy multiple LBs to handle the traffic and provide redundancy, though you need to use other methods to balance between them (DNSRR, anycast, etc).

sandGorgon · on Feb 19, 2017

great question!

kubernetes pretty much mandates either an LB because it is offloading the ingress job to a machine which can talk to the outside world and the VPC simultaneously. Otherwise you either get into complex hostport/nodeport/hostnetwork semantics... or use an ingress.

In the k8s world, there is not many simpler alternatives.

predakanga · on Feb 19, 2017

Could you provide some more detail about your statement that "nginx-ingress will lose source-ip info"?

I'm currently working towards a k8s-on-metal system using nginx for ingress and was led to believe that I could use either X-Forwarded-For or proxy_protocol to preserve the originating IP address. Is the traffic bounced through an SNAT rule on it's way to the ingress controller or similar?

sandGorgon · on Feb 19, 2017

you should google for proxy-protocol. so here's the thing - nginx cannot INJECT proxy protocol. I'm not an expert in nginx.. but atleast the versions used in k8s couldnt.

It can read/pass-through proxy protocol headers. So if you have haproxy in front of nginx.. it will work fine (or ELB/GLB).

Which is why there's another beta ingress - the "keepalived-vip" repo in k8s ingress. However, not a lot of people seem to be using haproxy primarily because haproxy cannot have a zero-downtime deploy.

A couple of weeks back on k8s slack, I was discussing using Github multibinder to solve that one problem and move ahead with haproxy ingress.

protip: please join the #sig-onpremise channel on slack. There's not a lot of momentum for bare metal deploys otherwise.

https://githubengineering.com/glb-part-2-haproxy-zero-downti...

predakanga · on Feb 19, 2017

I'm familiar with proxy-protocol, and while I haven't injected it with nginx, it seems that it does support it: https://nginx.org/en/docs/stream/ngx_stream_proxy_module.htm...

For reference, the feature was added in nginx 1.9.2, and the ingress-nginx controller appears to use nginx 1.11.9: https://github.com/kubernetes/ingress/blob/0.9.1/images/ngin...

Regardless, would I be right in assuming that your issues are mostly related to the ingress of TCP streams? My usecase is almost exclusively HTTP, so I might be missing the worst of it.

Thanks for the pointer to the #sig-onpremise channel, I'll be sure to check it out. Been meaning to look into the SIGs since reading https://coreos.com/blog/self-hosted-kubernetes.html

sandGorgon · on Feb 19, 2017

please double check - nginx's support for proxy protocol is to be able to read it. it cannot support injecting it. if nginx is the first thing your traffic hits, it will not inject proxy protocol headers. Again, I will not stake my reputation on this... but I remember going deep into this about 2 months back.

if you are using nginx ingress, then you need to move all of your existing config into their own config file format. Since setting up full TLS internally in kubernetes is buggy, I didnt want to terminate my traffic on the ingress. Also I really did not want to move my (fairly complex) nginx configuration to the ingress.

Which is why I'm a big believer in haproxy ingress - it was designed to interface with nginx.

However, you are already one of the "enlightened" ones. You actually do know what an ingress is. A significant amount of k8s slack traffic is "how do I setup a loadbalancer/how to make the cluster available to the outside world"

justinsb · on Feb 19, 2017

nginx can inject proxy_protocol - I know because I've enabled it by mistake! The kubernetes nginx ingress controller doesn't support it though. I think that is because most people use HTTP headers (X-Forwarded-For) instead. But if you do want it, you should open an issue explaining the use case so we can check our assumptions!

sandGorgon · on Feb 19, 2017

@justinsb - ur the expert here and i will not dispute this :D

I still believe that there is NO better tool to use kubernetes than kops. I now use the 100.64 subnet that you guys figured out ("carrier grade NAT" seriously?) in docker swarm.

but genuine question - can u show me how ? Because I honestly went crawling into the nginx codebase and could only find client side handling of proxy protocol.

I found NO place where it showed how it could be injected and chain reverse proxies together (that is what it was invented for in the first place).

P.S. just a quick google reveals this link which sort of confirms my suspicion - http://blog.haproxy.com/haproxy/proxy-protocol/

justinsb · on Feb 20, 2017

It's the same proxy_protocol keyword for both, which is why this is so confusing. In the listen line it means "remove proxy protocol from the inbound connection", as a top level directive "proxy_protocol on" on the server it means "add proxy protocol to the outbound connection"

This commit should show the difference: https://github.com/kubernetes/ingress/commit/6fa461c2a7891b4...

This is the nginx function AFAICT: https://github.com/nginx/nginx/blob/b580770f3afaeec48a15cb8c...

Looking at that though, maybe it only works with SSL passthrough... but that is the typical use-case for using proxy protocol instead of X-Forwarded-For

zimbatm · on Feb 19, 2017

What's wrong with using the X-Forwarded-For/X-Forwarded-Proto headers as usual? Or do you mean for non-HTTP protocol?

monktastic1 · on Feb 19, 2017

Yeah, looks like he's talking about L4 proxying. Check out http://www.haproxy.org/download/1.8/doc/proxy-protocol.txt.

koltaggar · on Feb 19, 2017

Couldnt agree more with the ingress sentiments

ricw · on Feb 19, 2017

This is nice and an interesting overview, but it still leaves me confused whether one should even use k8s. We currently use a docker-compose to set up and run multi-container services (django, celery & workers, nginx, cron/backup), yet I'm somewhat unconvinced of why we should invest in "upgrading" to kubernetes.

My understanding is that it will make the system deal automatically with failing containers, though docker-compose already does that for us. Load balancing is another strength, though that is currently not an issue for us. It might automate certain container managing scripts, but that is less of an issue right now

In other words, I currently don't see an actual benefit of k8s over even docker-compose (not docker-swarm, which I understand is the direct k8s alternative). Anyone able to elaborate?!

bpineau · on Feb 19, 2017

A few improvments I enjoyed when migrating from Swarm to Kubernetes (older Swarm version, though):

* Namespaces, to group objets and prevent names clashs. Ideal when many devs runs their own copy of the same distributed app from the same manifests and having the same names.

* ConfigMap, to share common configs and project them to files in the containers (and keep them updated in the container, etc), and to provide the configuration separately

* Automatic (EBS, or GCS, or NFS...) volume attachment, ie. to the node hosting your database; and re-attaching elsewhere if the node fails and the database is re-scheduler on an other host. Automated database HA/failover is very nice.

* Ingress manifests, to version the http routing with your app. Though this is very young and incomplete yet.

But yes, k8s is much more complex than Compose (and Swarm), so if the later works for you, changing is probably overkill.

why-el · on Feb 19, 2017

I have been thinking about this myself lately, and always wondered whether K8s is missing a chance to position itself as the "resource manager" by default, regardless of what size your current load is at.

In other words, change the terminology so that K8s appears to be a good choice even if you are just starting (this is the marketing part). Granted it will seem too complex for most apps, but how about a community effort, perhaps backed by a company, that essentially wraps most stacks (say Rails and PG, Django and Mysql, perhaps a caching layer, nginx, and so on) in a K8s module so that devs can start coding and be oblivious to changing loads as their apps grow? In this scenario, K8s will just feel natural, and will be something you just have to learn as part of your journey (Git was like this for a lot of developers, but I concede the analogy is not that great).

Even if most devs reject this as being too complex, the few who pick it will be an invaluable feedback source.

gtirloni · on Feb 19, 2017

Maybe you have Helm in mind?

https://github.com/kubernetes/helm

why-el · on Feb 20, 2017

Yeah that looks very promising.

justinsaccount · on Feb 19, 2017

> In other words, I currently don't see an actual benefit of k8s over even docker-compose

Do you currently have more than one server? It sounds like you don't.

ricw · on Feb 20, 2017

we do. about a handful right now. it's still manageable, though automation of server resources may be another reason to use k8s?!

untoreh · on Feb 20, 2017

But how many more? If its just pet servers k8s is still not worth it, despite it having statefulsets since 1.5. K8s is mostly for cattle, and pets when you need them, imho.

justinsaccount · on Feb 20, 2017

Take your example of 'django, celery & workers'

What would you currently do if your celery workers couldn't keep up and you needed to add additional machines to handle the load?

amq · on Feb 19, 2017

The first thing I'd like to be taught is where to use Kubernetes and where to avoid it (many people tend to think it's the best practice to force everything into containers, and ideally, Kubernetes).

ekidd · on Feb 19, 2017

My rule of thumb is "keep it as simple as possible", and I say this as somebody who maintains a complex, containerized project using both Amazon's ECS and Kubernetes. For example:

1. If you can use a simple deployment system based on 'git push' (such as Heroku), you should just go ahead and do that until you outgrow it. This is easy to test, deploy and maintain.

2. If your project is tricky to build, go ahead and set up a Dockerfile with a correctly configured environment. You can still deploy this using Heroku, Google Container Engine, Amazon ECS or many other cloud providers. Choose something simple. This is harder than something like Heroku, but it's mostly a one-time learning cost.

3. If you need multiple containers, then set up a continuous integration system that automatically builds and tests your containers on every 'git push', and which allows you to deploy a service with a single click. You'll also want a separate staging environment. At this point, you're probably going to need to devote at least 10 hours/week of engineering time to keeping everything working. System-wide integration testing will get more complex.

Again, you can stay on a managed cloud provider like Amazon's ECS or Google Container Engine for a long time before you need to set up and manage your own Kubernetes cluster. But if you find that you're writing lots of scripts to manage the containers on your clusters, it's probably time to look at a standard solution like Kubernetes.

TL;dr: Stay with the easiest deployment solutions, and only add complexity when you need it.

voiceclonr · on Feb 19, 2017

I tend to agree with you. Often, I see some of these stack decisions are forced by decision makers who want their design to look modern.

ownagefool · on Feb 20, 2017

Sometimes it's just because of experience though.

If you're simply bootstrapping a single app on a single vm, then fine, have at it.

* As soon as you go down the route of making that reproducable, it can as easily be mostly done by a Dockerfile than a bash or Ansible (or any other configuration management) script.

* Then you want failover or have it active on another server? Okay, you can just run you ansible against, maybe you put it in cloud-init and an autoscaling group, or maybe you can have something like a kubernetes framework take care of that.

* You want health checking? Sure, have it configuring nagios or something similar, or you can have a ELB check an endpoint, or you can have kubernetes do it?

* Want some storage? Let me add a PVC, or I can play about with managing EBSs and other block storage myself.

Once you start digging a bit deeper, realising that many of the things your apps will probably want you either need to build yourself, or will lock you into AWS, going to google and clicking the GKE button doesn't seem a terrible prospect. There are other ways to do things, but once you've learnt this way, you can reuse it almost anywhere.

Our industry is really faddish so I don't blame you for being skeptical, but the second you start hosting 3-4 apps, I'd rather have just bootstrapped kubernetes (or went to a managed service provider) and have all the primitives avaiable to me.

louiskottmann · on Feb 19, 2017

Which is not a bad call, looking modern helps recruit.

It's great that technology x lets you build something in 1 month, but if it takes 6 months to find someone willing to do it, then you could have written it faster in almost any other technology.

ben_jones · on Feb 19, 2017

I say this only partially in jest but, I hope when presented with such choices I always pick the one that allows me the most hours of peaceful sleep at night. This usually gets me half way there during decision making.

stavros · on Feb 19, 2017

Why even partially in jest? Isn't the whole point of ops "sleeplessness avoidance"?

ben_jones · on Feb 20, 2017

Honestly ops is about making/saving money for the company. Like an HR employee can choose the right thing, an ops engineer can also choose "sleeplessness avoidance", but it doesn't always work out that way.

stavros · on Feb 20, 2017

That's true, but "sleeplessness avoidance" saves lots more money for small-scale companies. Downtime and man-hours are much more expensive than getting more reliable hosting, unless you're a very large company.

user5994461 · on Feb 19, 2017

Use for: Stateless applications (mostly web applications).

Ban for: Databases.

derekperkins · on Feb 19, 2017

I know that banning databases from k8s is the popular point of view now, but Google has been running databases in containers for nearly a decade. YouTube runs on top of http://vitess.io/, which is a cloud native approach to MySQL.

Certainly approach with caution, but there's no reason for a blanket dismissal of dbs in containers.

smarterclayton · on Feb 19, 2017

I think it's even simpler:

1. Know what it takes to run a database (including storage, backup, upgrade, lifecycle, failure modes)

2. Know how a containerized cluster manager manages process, storage, lifecycle, and failure modes

If you know both of those, running databases on Kubernetes (can't speak for swarm or mesos) is not hard or surprising, and you can get the benefits of both. If you don't know both of those in detail, you don't have any business running databases in containers.

The intersection of folks who know both is still small. And the risk of problems when you understand only one is still high.

bogomipz · on Feb 19, 2017

>"If you know both of those, running databases on Kubernetes (can't speak for swarm or mesos) is not hard or surprising,"

Are you speaking from experience when you say it is not hard? Could you elaborate on what databases your are currently running on Kubernetes and how they are configured? Also are these production?

If I know number 1 and number 2 does that mean that I automatically understand all of the of the potential failure modes I might experience from combining 1 and 2? I certainly wouldn't think so.

smarterclayton · on Feb 19, 2017

I'm one of the engineers on OpenShift, and there have been different production databases (sql and nosql alike) running on OpenShift in very large companies for almost 2 years now, as well as many databases in staging and test configurations.

Your point about 1/2 is fair, I was trying to convey that Kube follows certain rules w.r.t. process termination, storage, and safety that can be relied on when you internalize them. What's lacking today is the single doc that walks people through the tradeoffs and is easily approachable (although the stateful set docs do a pretty good job of it). In addition, we've made increasing effort at ensuring that behavior is predictable (why StatefulSets exist, and the changes in 1.5 to ensure terminating pods remain even if the node goes down).

Storage continues to be the most important part of stateful apps in general. On AWS/gce/azure you get safe semantics for fencing storage (as long as you don't bend the rules). On metal you'll need a lot more care - the variety of NAS storage comes with lots of tradeoffs, and safe use assumes a level of sophistication that I wouldn't expect unless folks have made an investment in storage infrastructure. I expect that to continue to improve, with things like Ceph and Glusters direct integration, VMWare storage, and NetApp / other serious NFS integration.

And it's always possible to treat nodes like pets on the cloud and leverage their local storage if you have good backups - at scale that can be fairly effective, but when doing one-off DBs using RDS and Aurora and others is hard to beat.

bogomipz · on Feb 20, 2017

I am not very clear on the differences between running Kubernetes via OpenShift vs metal or a cloud provider. I even just looked at the RH page and it still wasn't that clear to me. Can you elaborate? Is there a different story for stateful things like running datastores on K8+OopenShift?

bogomipz · on Feb 19, 2017

>"I know that banning databases from k8s is the popular point of view now, but Google has been running databases in containers for nearly a decade."

But containerizing a workload isn't the same thing as handing it off to a cluster scheduler to manage. Google hasn't been running databases via K8 for nearly a decade. Who knows how Borg handles volume management internally at Google. I realize K8 has foundations in Borg but its still not apple to apples I don't think.

ithkuil · on Feb 19, 2017

FYI: https://en.m.wikipedia.org/wiki/Google_File_System

GFS (now colossus) is not mounted as a legacy volume, but instead is accessed via a userspace library.

user5994461 · on Feb 19, 2017

Google has never run any database in Docker.

They use internal proprietary technology that doesn't have the same characteristics and flaws than Docker.

clebio · on Feb 19, 2017

Parent didn't say Docker, fwiw.

halbritt · on Feb 19, 2017

Uber runs Cassandra in Mesos:

http://highscalability.com/blog/2016/9/28/how-uber-manages-a...

Seems to work pretty well. DCOS has lots of database options.

gaius · on Feb 19, 2017

What about running the DB process(es) in a container mounting non-containerized storage from outside? All the "state" is then external to the container? This is very do-able with Docker and it's where my thoughts are heading for a "best of both worlds".

atombender · on Feb 19, 2017

That's what you do. If you're on GCP or AWS, you typically mount a network volume (Persistent Disk on GCP, EBS on AWS).

Kubernetes ensures that the container always has this volume mounted, and of course only one container at a time can claim the volume for itself.

What you should avoid doing is to use a host mount and pin a pod to run on a specific node, because then that pod can only run on that node, and you have no way of migrating without manually moving the mount and unpinning the pod. With Kubernetes, you really want to avoid thinking about nodes at all. State follows pods around; pods don't follow state around.

sougou · on Feb 19, 2017

You also have the option of using local storage and combine it with a consensus protocol to keep the data distributed. You can actually achieve better durability than mainframes.

Spanner uses cross-datacenter Paxos. Your data won't be lost even if an entire datacenter goes dark.

For Vitess (http://vitess.io), we use semi-sync replication that always ensures that at least one other machine has the data.

gtirloni · on Feb 19, 2017

In K8s that's called "Stateful Sets" (beta in v1.5).

https://kubernetes.io/docs/concepts/abstractions/controllers...

atombender · on Feb 19, 2017

You can solve this without StatefulSets, it's just a manual process, and requries that you (1) don't use a replication controller (edit: rather, you use a controller with "replicas: 1"), and (2) can ensure that a single pod claims the data volume.

StatefulSets are more geared towards apps that manage their own redundancy, such as Cassandra or Aerospike, where adding another instance is a matter of just starting it. One of the things that a StatefulSet permits is to preserve the network identity of a pod. For example, if you wanted deploy Cassandra without StatefulSets, you'd deploy each instance as a separate Deployment + Service pair, called, let's say, cassandra-1, cassandra-2 and so on. You would not be able to use Kubernetes' toolsets to scale the cluster. Each instance would use a persistent volume, so effectively it would be almost exactly like a StatefulSet, except Kubernetes would not be handling the pod replication.

In the case of something like Postgres, you'd probably not get any benefit from using a StatefulSet for the master (since only one instance can run), but you can use a StatefulSet to run read-only replicas.

gtirloni · on Feb 19, 2017

Without Stateful Sets, or a replication controller, doesn't it mean that if my host dies, the DB won't get started anywhere? Who's managing where that pod should spawn next (with its associated storage)?

atombender · on Feb 19, 2017

Right, I simplified a bit there: You would use a replication controller, but it would have "replicas: 1". This way, the pod is rescheduled and the service repointed, and the volume management ensures that the database gets mounted.

gtirloni · on Feb 19, 2017

Got it, thanks!

smarterclayton · on Feb 19, 2017

Note that replicas: 1 does not actually guarantee "at most 1". If you have block storage with locks (AWS/gce/ceph/cinder), then he second replica won't start until the first is gone. If you try to use "replicas: 1" with a shared filesystem you can have 2 pods running against that filesystem at once.

StatefulSets guarantees "at most one"

atombender · on Feb 20, 2017

That's solved by setting "strategy.type" to "Recreate", isn't it? This will disable rolling deploys. The replication controller wouldn't then attempt to have two pods running at the same time.

pgrs · on Feb 19, 2017

Yep. Even works in Docker Swarm (1.12 an up) via Volumes.

ricw · on Feb 19, 2017

From what I can tell this is largely outdated advice. We've been running multiple 800GB postgres instances in docker for over a year now, with not a single problem. Not one.

amq · on Feb 19, 2017

Would be an interesting write-up. How do you handle upgrades? What is the reason for Docker in your case? Are you managing the individual containers manually or using some tool?

user5994461 · on Feb 20, 2017

It is very up-to-date advice.

Just because you were lucky to not experience massive issues doesn't mean they aren't present.

ricw · on Feb 20, 2017

In what way is it up-to-date? From what I can tell the majority of horror stories were related to running "outdated" linux kernels, which ubuntu 16.04 fixed for us (we purposefully only started using docker with the beta version of ubuntu 16.04, hence a year).

In fact, running the postgres instances in isolation has given us far more confidence than if they were run "natively". backing up docker instances is trivially easy in comparison to running native instances, as you already know what data volumes you need to back up. all our instances use exactly the same backup and restoration script. all our instances get rolled into staging using the same script on a daily basis. no failures so far. zero.

would be interested in actual "up-to-date" reasons, other than "docker's engineering department is not dependable", which btw I can emphasise with if you were burned in the past.

user5994461 · on Feb 20, 2017

You're the perfect example of the problem.

The typical dev who thinks running on a beta version of Ubuntu is the norm and calls anything else "outdated".

Yes, docker may be up to your standards.

No, docker is not up the standards of real businesses, who use stable OS and sometimes even paid support for it.

raesene9 · on Feb 19, 2017

One thing that I think is very much still missing from the Kubernetes Documentation space is hardening guidelines.

There's a lot of moving parts in there and some of the defaults for common install methods like kubeadm might be a bit of a surprise to people (e.g. the kubelet port being default open and allowing someone to take complete control of the cluster without authentication (https://raesene.github.io/blog/2016/10/08/Kubernetes-From-Co...)

Ideally something which broke out the various components and had guidelines for possible security options would be a great addition, I think.

clebio · on Feb 19, 2017

Link in parenthesis got broken: https://raesene.github.io/blog/2016/10/08/Kubernetes-From-Co...

atombender · on Feb 19, 2017

For me, the weakest point of Kubernetes at the moment is the developer story. Many shops don't have dedicated ops engineers, and would need to adapt/abstract K8s in order to make it user friendly enough for developers to use.

For example, to start:

* Where do you store the YAML manifests, and how do you maintain them? Do you put them in the same git repo as your app, and if so, how do you deal with the fact that the YAML files are going to be different for production, staging, testing/QA, etc.? (For example, ingresses will use other host names. Configs are likely to be rather different altogether.)

* Or if you centralize them in a single git repo, you have to make sure that you always pull the newest version, and that your workflow includes diffing against the currently deployed version, and so on.

* How do you protect configs/secrets, if they're in git and available to all?

* Dependency tracking: If app A needs app B, you want to encapsulate that dependency in your workflow.

* Continuous delivery: How do you take care of these concerns in relation to the CD system (e.g. Drone, Jenkins)?

* And of course, you'll want to be able to develop locally. If you run Minikube, how do interact with it, YAML-wise?

There's Helm, but Helm is essentially just a "templates in a central repo" manager. It doesn't solve the configuration issue: You still need to provide values to the charts.

It doesn't sit right to have the YAML files in the project's git repo. Config follows environment, not project.

For our current, non-Kubernetes production setup, we have a tool called Monkey that allows people do things like "monkey deploy staging:myapp -r mybranch" to deploy a branch, or "monkey sql prod:myapp" to get a PostgreSQL shell against the production database for that app, with lots of nice commands to work with the cluster. They also work directly with the Vagrant VM that developers run locally: "monkey deploy dev:myapp --dev" actually deploys to the VM, pointing the directory to the user's own project (on the host machine), so that the files are the same. Combined with a React/Redux app, they get hot code reloading and so on without having to do anything special. This shields developers from the complexities and realities of a Linux server in a very nice, convenient way. We want to provide the same on top of Kubernetes.

Right now I'm debating whether to just go for "simple and stupid" and have a central repo, and then build a small toolset to wrap kubectl for devs. By running Minikube locally, devs would be able to use the same tool locally. The command-line tool could do things like always pull the repo to ensure you're working with the newest files. The main challenge there is organizing the YAML files in a clean way. Do we use templates to reduce boilerplate? Or do we rely on just copy-pasting stuff? It's not quite clear to me.

Helm seems nice, but it's another moving part to add to the mix. I'm leaning towards a simpler approach. I've looked into Deis Workflow, but it's actually a full-blown PaaS, and seems a bit much.

maktouch · on Feb 20, 2017

I've been running K8S for the past 6 months. So far, so good. Helm helped us a lot in our workflow. A lot of trial and errors, but to answer your points, this is what we're doing so far:

> * Where do you store the YAML manifests, and how do you maintain them? Do you put them in the same git repo as your app, and if so, how do you deal with the fact that the YAML files are going to be different for production, staging, testing/QA, etc.? (For example, ingresses will use other host names. Configs are likely to be rather different altogether.)

We store the manifests in the same git repo as the app. Docker images and YAML files are all the same, what's different are the values, so we use helm and override the values per environment.

The values are split 4 ways: - default sensible values (values.yaml). For example, feature flags. - datacenter specific (gcp.yaml, aws.yaml). For example, REDIS_HOST and MYSQL_HOST. - environment values (env-production.yaml or env-staging.yaml). For example, the ingress values and its certs, how many replicas for each service - secrets.yaml (not stored in git, but generated on the fly, will explain more later). For example, MYSQL_PASSWORD

I mean, if you're doing your services right, they should all have the same docker images, only the config changes. Helm has a pretty good templating and config helpers.

> * Or if you centralize them in a single git repo, you have to make sure that you always pull the newest version, and that your workflow includes diffing against the currently deployed version, and so on.

Helm takes care of this. When you upgrade a deployment, it looks like it's doing a diff. What's cool though, is that it seems like it does a diff on everything except replicas, which is exactly what we needed.

> * How do you protect configs/secrets, if they're in git and available to all?

I don't mind having the config in git, I do when it comes to secret. We store secrets as a gitlab variable, and our CI dynamically creates a "secrets.yaml" before deploying. I'm still not really happy about this though, I think a better way would be to use Vault, but it adds some complexities that I'm not really to deal with yet.

> * Dependency tracking: If app A needs app B, you want to encapsulate that dependency in your workflow.

I'm not too sure I understand what you mean by this one. We, thankfully, merged all our git repos into 1 monolithic repository. I remember at first reading about the big boys (FB, MS, Google) having a single repository and thought they were crazy. Then we added more and more services and suddenly, I just realized that having a monorepo makes is a lot easier for developers and ops. AppA~1.0 needs AppB~2.5. We used to have a crazy graph of dependency like that. Not anymore. Now every services has the same version -- we're simply using git hashes. AppA~4c20d needs AppB-4c20d. Every push to the repo rebuilds the docker images and tags them like this: AppA:master-4c20d and AppB:master-4c20d. And yes, we build for all branches and all commits.

So for deploying, it's super easy. I just deploy everything at once, at all time. If I need to rollback, I can rollback to a specific version on all services. If I need to test in a real environment a specific version, all I do is add --set BRANCH=fix-bug-branch,COMMIT=4c20d

> * Continuous delivery: How do you take care of these concerns in relation to the CD system (e.g. Drone, Jenkins)?

There's no concerns here since every code push starts our pipeline, and the pipeline rebuilds all docker images, and runs the tests between each other. It's pretty dope.

> * And of course, you'll want to be able to develop locally. If you run Minikube, how do interact with it, YAML-wise?

We have a local bare-metal server running a single-node k8s, but to be honest, we've never had a dev needing to develop on K8S specifically. Each dev that builds a service is also tasked with building the dockerfile that goes with it, and with assistance, they build the YAML file that goes with it. It's pretty straightforward. Our gitlab auto deploys to dev, staging and review environment, so if there's an invalid YAML, the pipeline will catch it.

> There's Helm, but Helm is essentially just a "templates in a central repo" manager. It doesn't solve the configuration issue: You still need to provide values to the charts.

I think it does but not directly. It solves it because it allows you to use templates and override config in a sane way.

Here's an extract of our .gitlab-ci file for deploying to production

- echo "Deploying to Kubernetes Env=production Branch=$CI_BUILD_REF_NAME Version=$CI_BUILD_REF"

- "echo MYSQL_PASSWORD: $MYSQL_PASSWORD_PRODUCTION > tmp.yaml"

- "echo MYSQL_USERNAME: $MYSQL_USERNAME_PRODUCTION >> tmp.yaml"

- ... more secrets echo config here

- helm upgrade master ./manifests/App -f ./manifests/gcp.yaml -f ./manifests/env-production.yaml -f tmp.yaml --set BRANCH=master,COMMIT=$CI_BUILD_REF

> Right now I'm debating whether to just go for "simple and stupid" and have a central repo, and then build a small toolset to wrap kubectl for devs [...]

We started with building our own toolset to wrap kubectl. It worked. Then we found Helm. I remember at first looking at it and thought it wasn't useful for us.. now it looks pretty good. We heavily use templates to reduce boilerplate.

-----

I'm curious about your tool called Monkey. Is it a frontend for chef or ansible? Also, I can pretty much ask all the same questions you've asked but geared towards your tool (where do you store manifests, how do you protect config/secrets, how do do you handle dependency, continuous delivery, etc etc).

atombender · on Feb 20, 2017

Thanks for the summary. Helm does look like it might be the way forward, although it doesn't address every single issue I have.

To address some of your responses:

Monorepo: I see the benefits, but there are also some big downsides to this approach. It's problematic to have to sync code in/out of a monorepo when you have lots of open source projects. You'll also end up with much more frequently needing to pull and deal with upstream changes completely unrelated to your stuff; and git commands like "git log" now need to be invoked with "git log ." to avoid getting a swathe of unrelated commits outside the app you're working on. And so on. Not about to go down that road.

Regarding dependencies, I'm talking about a development environment where you'd want to run a subset of the apps needed to work on the stack. (Running all of our stuff at once on one machine would make for a _really_ heavy VM.) We rely on each developer having a VM controlled with Vagrant. If you want to work on app A, which depends on app B, then a developer shouldn't need to read the readme to figure that out; they should be able to just deploy A, and have B be included automatically. (This is a lot more trivial than the other challenges, and can be solved with annotations.)

Lastly, about Monkey: It is simply a small tool that controls our VMs via SSH. The VMs are already statically configured with Puppet, so Monkey knows what app should be deployed where (it can ask PuppetDB), and it just runs commands to do things like "npm install" and "go build". Some of that stuff is read from Puppet, some is declared as part of Monkey's config. But it's entirely manual. This is the system we want to move away from, of course, so it's not a template of how things should be done.

But the point about Monkey is that it's a convenience tool that glosses over the gritty details of interacting with a cluster. A developer doesn't need to know what's going on behind the scenes of a deploy command. I want to achieve the same thing with Kubernetes.

We're not at the point where we want to do continuous delivery, but I'd like to use Drone to perform the Docker build, as it can also run tests at the same time. This is iffy, since a developer would need a way to wait for Drone to push the final Docker image — unless builds are themselves manual, which is another option. Yet another option is to reserve a specific branch (e.g. "prod") for testing.

dustinmoris · on Feb 20, 2017

Kubernetes is really cool. The only bad thing I can say about Kubernetes is that it has "uber" in its name.