]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Testing our PostgreSQL pod
- We will use `kubectl exec` to get a shell in the pod
- Good to know: we need to use the `postgres` user in the pod
.lab[
- Get a shell in the pod, as the `postgres` user:
```bash
kubectl exec -ti postgres-0 -- su postgres
```
- Check that default databases have been created correctly:
```bash
psql -l
```
]
(This should show us 3 lines: postgres, template0, and template1.)
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Inserting data in PostgreSQL
- We will create a database and populate it with `pgbench`
.lab[
- Create a database named `demo`:
```bash
createdb demo
```
- Populate it with `pgbench`:
```bash
pgbench -i demo
```
]
- The `-i` flag means "create tables"
- If you want more data in the test tables, add e.g. `-s 10` (to get 10x more rows)
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Checking how much data we have now
- The `pgbench` tool inserts rows in table `pgbench_accounts`
.lab[
- Check that the `demo` base exists:
```bash
psql -l
```
- Check how many rows we have in `pgbench_accounts`:
```bash
psql demo -c "select count(*) from pgbench_accounts"
```
- Check that `pgbench_history` is currently empty:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Testing the load generator
- Let's use `pgbench` to generate a few transactions
.lab[
- Run `pgbench` for 10 seconds, reporting progress every second:
```bash
pgbench -P 1 -T 10 demo
```
- Check the size of the history table now:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
Note: on small cloud instances, a typical speed is about 100 transactions/second.
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Generating transactions
- Now let's use `pgbench` to generate more transactions
- While it's running, we will disrupt the database server
.lab[
- Run `pgbench` for 10 minutes, reporting progress every second:
```bash
pgbench -P 1 -T 600 demo
```
- You can use a longer time period if you need more time to run the next steps
]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Find out which node is hosting the database
- We can find that information with `kubectl get pods -o wide`
.lab[
- Check the node running the database:
```bash
kubectl get pod postgres-0 -o wide
```
]
We are going to disrupt that node.
--
By "disrupt" we mean: "disconnect it from the network".
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Node failover
⚠️ This will partially break your cluster!
- We are going to disconnect the node running PostgreSQL from the cluster
- We will see what happens, and how to recover
- We will not reconnect the node to the cluster
- This whole lab will take at least 10-15 minutes (due to various timeouts)
⚠️ Only do this lab at the very end, when you don't want to run anything else after!
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Disconnecting the node from the cluster
.lab[
- Find out where the Pod is running, and SSH into that node:
```bash
kubectl get pod postgres-0 -o jsonpath={.spec.nodeName}
ssh nodeX
```
- Check the name of the network interface:
```bash
sudo ip route ls default
```
- The output should look like this:
```
default via 10.10.0.1 `dev ensX` proto dhcp src 10.10.0.13 metric 100
```
- Shutdown the network interface:
```bash
sudo ip link set ensX down
```
]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
class: extra-details
## Another way to disconnect the node
- We can also use `iptables` to block all traffic exiting the node
(except SSH traffic, so we can repair the node later if needed)
.lab[
- SSH to the node to disrupt:
```bash
ssh `nodeX`
```
- Allow SSH traffic leaving the node, but block all other traffic:
```bash
sudo iptables -I OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -I OUTPUT 2 -j DROP
```
]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Watch what's going on
- Let's look at the status of Nodes, Pods, and Events
.lab[
- In a first pane/tab/window, check Nodes and Pods:
```bash
watch kubectl get nodes,pods -o wide
```
- In another pane/tab/window, check Events:
```bash
kubectl get events --watch
```
]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Node Ready → NotReady
- After \~30 seconds, the control plane stops receiving heartbeats from the Node
- The Node is marked NotReady
- It is not *schedulable* anymore
(the scheduler won't place new pods there, except some special cases)
- All Pods on that Node are also *not ready*
(they get removed from service Endpoints)
- ... But nothing else happens for now
(the control plane is waiting: maybe the Node will come back shortly?)
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Pod eviction
- After \~5 minutes, the control plane will evict most Pods from the Node
- These Pods are now `Terminating`
- The Pods controlled by e.g. ReplicaSets are automatically moved
(or rather: new Pods are created to replace them)
- But nothing happens to the Pods controlled by StatefulSets at this point
(they remain `Terminating` forever)
- Why? 🤔
--
- This is to avoid *split brain scenarios*
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
class: extra-details
## Split brain 🧠⚡️🧠
- Imagine that we create a replacement pod `postgres-0` on another Node
- And 15 minutes later, the Node is reconnected and the original `postgres-0` comes back
- Which one is the "right" one?
- What if they have conflicting data?
😱
- We *cannot* let that happen!
- Kubernetes won't do it
- ... Unless we tell it to
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## The Node is gone
- One thing we can do, is tell Kubernetes "the Node won't come back"
(there are other methods; but this one is the simplest one here)
- This is done with a simple `kubectl delete node`
.lab[
- `kubectl delete` the Node that we disconnected
]
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Pod rescheduling
- Kubernetes removes the Node
- After a brief period of time (\~1 minute) the "Terminating" Pods are removed
- A replacement Pod is created on another Node
- ... But it doesn't start yet!
- Why? 🤔
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Multiple attachment
- By default, a disk can only be attached to one Node at a time
(sometimes it's a hardware or API limitation; sometimes enforced in software)
- In our Events, we should see `FailedAttachVolume` and `FailedMount` messages
- After \~5 more minutes, the disk will be force-detached from the old Node
- ... Which will allow attaching it to the new Node!
🎉
- The Pod will then be able to start
- Failover is complete!
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Check that our data is still available
- We are going to reconnect to the (new) pod and check
.lab[
- Get a shell on the pod:
```bash
kubectl exec -ti postgres-0 -- su postgres
```
- Check how many transactions are now in the `pgbench_history` table:
```bash
psql demo -c "select count(*) from pgbench_history"
```
]
If the 10-second test that we ran earlier gave e.g. 80 transactions per second,
and we failed the node after 30 seconds, we should have about 2400 row in that table.
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
## Double-check that the pod has really moved
- Just to make sure the system is not bluffing!
.lab[
- Look at which node the pod is now running on
```bash
kubectl get pod postgres-0 -o wide
```
]
???
:EN:- Using highly available persistent volumes
:EN:- Example: deploying a database that can withstand node outages
:FR:- Utilisation de volumes à haute disponibilité
:FR:- Exemple : déployer une base de données survivant à la défaillance d'un nœud
.debug[[k8s/stateful-failover.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/stateful-failover.md)]
---
class: pic
.interstitial[]
---
name: toc-git-based-workflows-gitops
class: title
Git-based workflows (GitOps)
.nav[
[Previous part](#toc-stateful-failover)
|
[Back to table of contents](#toc-part-11)
|
[Next part](#toc-fluxcd)
]
.debug[(automatically generated title slide)]
---
# Git-based workflows (GitOps)
- Deploying with `kubectl` has downsides:
- we don't know *who* deployed *what* and *when*
- there is no audit trail (except the API server logs)
- there is no easy way to undo most operations
- there is no review/approval process (like for code reviews)
- We have all these things for *code*, though
- Can we manage cluster state like we manage our source code?
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Reminder: Kubernetes is *declarative*
- All we do is create/change resources
- These resources have a perfect YAML representation
- All we do is manipulate these YAML representations
(`kubectl run` generates a YAML file that gets applied)
- We can store these YAML representations in a code repository
- We can version that code repository and maintain it with best practices
- define which branch(es) can go to qa/staging/production
- control who can push to which branches
- have formal review processes, pull requests, test gates...
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Enabling git-based workflows
- There are a many tools out there to help us do that; with different approaches
- "Git host centric" approach: GitHub Actions, GitLab...
*the workflows/action are directly initiated by the git platform*
- "Kubernetes cluster centric" approach: [ArgoCD], [FluxCD]..
*controllers run on our clusters and trigger on repo updates*
- This is not an exhaustive list (see also: Jenkins)
- We're going to talk mostly about "Kubernetes cluster centric" approaches here
[ArgoCD]: https://argoproj.github.io/cd/
[FluxCD]: https://fluxcd.io/
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## The road to production
In no specific order, we need to at least:
- Choose a tool
- Choose a cluster / app / namespace layout
(one cluster per app, different clusters for prod/staging...)
- Choose a repository layout
(different repositories, directories, branches per app, env, cluster...)
- Choose an installation / bootstrap method
- Choose how new apps / environments / versions will be deployed
- Choose how new images will be built
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Flux vs ArgoCD (1/2)
- Flux:
- fancy setup with an (optional) dedicated `flux bootstrap` command
(with support for specific git providers, repo creation...)
- deploying an app requires multiple CRDs
(Kustomization, HelmRelease, GitRepository...)
- supports Helm charts, Kustomize, raw YAML
- ArgoCD:
- simple setup (just apply YAMLs / install Helm chart)
- fewer CRDs (basic workflow can be implement with a single "Application" resource)
- supports Helm charts, Jsonnet, Kustomize, raw YAML, and arbitrary plugins
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Flux vs ArgoCD (2/2)
- Flux:
- sync interval is configurable per app
- no web UI out of the box
- CLI relies on Kubernetes API access
- CLI can easily generate custom resource manifests (with `--export`)
- self-hosted (flux controllers are managed by flux itself by default)
- one flux instance manages a single cluster
- ArgoCD:
- sync interval is configured globally
- comes with a web UI
- CLI can use Kubernetes API or separate API and authentication system
- one ArgoCD instance can manage multiple clusters
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Cluster, app, namespace layout
- One cluster per app, different namespaces for environments?
- One cluster per environment, different namespaces for apps?
- Everything on a single cluster? One cluster per combination?
- Something in between:
- prod cluster, database cluster, dev/staging/etc cluster
- prod+db cluster per app, shared dev/staging/etc cluster
- And more!
Note: this decision isn't really tied to GitOps!
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Repository layout
So many different possibilities!
- Source repos
- Cluster/infra repos/branches/directories
- "Deployment" repos (with manifests, charts)
- Different repos/branches/directories for environments
🤔 How to decide?
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Permissions
- Different teams/companies = different repos
- separate platform team → separate "infra" vs "apps" repos
- teams working on different apps → different repos per app
- Branches can be "protected" (`production`, `main`...)
(don't need separate repos for separate environments)
- Directories will typically have the same permissions
- Managing directories is easier than branches
- But branches are more "powerful" (cherrypicking, rebasing...)
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Resource hierarchy
- Git-based deployments are managed by Kubernetes resources
(e.g. Kustomization, HelmRelease with Flux; Application with ArgoCD)
- We will call these resources "GitOps resources"
- These resources need to be managed like any other Kubernetes resource
(YAML manifests, Kustomizations, Helm charts)
- They can be managed with Git workflows too!
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Cluster / infra management
- How do we provision clusters?
- Manual "one-shot" provisioning (CLI, web UI...)
- Automation with Terraform, Ansible...
- Kubernetes-driven systems (Crossplane, CAPI)
- Infrastructure can also be managed with GitOps
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Example 1
- Managed with YAML/Charts:
- core components (CNI, CSI, Ingress, logging, monitoring...)
- GitOps controllers
- critical application foundations (database operator, databases)
- GitOps manifests
- Managed with GitOps:
- applications
- staging databases
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Example 2
- Managed with YAML/Charts:
- essential components (CNI, CoreDNS)
- initial installation of GitOps controllers
- Managed with GitOps:
- upgrades of GitOps controllers
- core components (CSI, Ingress, logging, monitoring...)
- operators, databases
- more GitOps manifests for applications!
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
## Concrete example
- Source code repository (not shown here)
- Infrastructure repository (shown below), single branch
```
├── charts/ <--- could also be in separate app repos
│ ├── dockercoins/
│ └── color/
├── apps/ <--- YAML manifests for GitOps resources
│ ├── dockercoins/ (might reference the "charts" above,
│ ├── blue/ and/or include environment-specific
│ ├── green/ manifests to create e.g. namespaces,
│ ├── kube-prometheus-stack/ configmaps, secrets...)
│ ├── cert-manager/
│ └── traefik/
└── clusters/ <--- per-cluster; will typically reference
├── prod/ the "apps" above, possibly extending
└── dev/ or adding configuration resources too
```
???
:EN:- GitOps
:FR:- GitOps
.debug[[k8s/gitworkflows.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/gitworkflows.md)]
---
class: pic
.interstitial[]
---
name: toc-fluxcd
class: title
FluxCD
.nav[
[Previous part](#toc-git-based-workflows-gitops)
|
[Back to table of contents](#toc-part-11)
|
[Next part](#toc-argocd)
]
.debug[(automatically generated title slide)]
---
# FluxCD
- We're going to implement a basic GitOps workflow with Flux
- Pushing to `main` will automatically deploy to the clusters
- There will be two clusters (`dev` and `prod`)
- The two clusters will have similar (but slightly different) workloads
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Repository structure
This is (approximately) what we're going to do:
```
├── charts/ <--- could also be in separate app repos
│ ├── dockercoins/
│ └── color/
├── apps/ <--- YAML manifests for GitOps resources
│ ├── dockercoins/ (might reference the "charts" above,
│ ├── blue/ and/or include environment-specific
│ ├── green/ manifests to create e.g. namespaces,
│ ├── kube-prometheus-stack/ configmaps, secrets...)
│ ├── cert-manager/
│ └── traefik/
└── clusters/ <--- per-cluster; will typically reference
├── prod/ the "apps" above, possibly extending
└── dev/ or adding configuration resources too
```
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Resource graph
flowchart TD
H/D["charts/dockercoins
(Helm chart)"]
H/C["charts/color
(Helm chart)"]
A/D["apps/dockercoins/flux.yaml
(HelmRelease)"]
A/B["apps/blue/flux.yaml
(HelmRelease)"]
A/G["apps/green/flux.yaml
(HelmRelease)"]
A/CM["apps/cert-manager/flux.yaml
(HelmRelease)"]
A/P["apps/kube-prometheus-stack/flux.yaml
(HelmRelease + Kustomization)"]
A/T["traefik/flux.yaml
(HelmRelease)"]
C/D["clusters/dev/kustomization.yaml
(Kustomization)"]
C/P["clusters/prod/kustomization.yaml
(Kustomization)"]
C/D --> A/B
C/D --> A/D
C/D --> A/G
C/P --> A/D
C/P --> A/G
C/P --> A/T
C/P --> A/CM
C/P --> A/P
A/D --> H/D
A/B --> H/C
A/G --> H/C
A/P --> CHARTS & PV["apps/kube-prometheus-stack/manifests/configmap.yaml
(Helm values)"]
A/CM --> CHARTS
A/T --> CHARTS
CHARTS["Charts on external repos"]
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Getting ready
- Let's make sure we have two clusters
- It's OK to use local clusters (kind, minikube...)
- We might run into resource limits, though
(pay attention to `Pending` pods!)
- We need to install the Flux CLI ([packages], [binaries])
- **Highly recommended:** set up CLI completion!
- Of course we'll need a Git service, too
(we're going to use GitHub here)
[packages]: https://fluxcd.io/flux/get-started/
[binaries]: https://github.com/fluxcd/flux2/releases
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## GitHub setup
- Generate a GitHub token:
https://github.com/settings/tokens/new
- Give it "repo" access
- This token will be used by the `flux bootstrap github` command later
- It will create a repository and configure it (SSH key...)
- The token can be revoked afterwards
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Flux bootstrap
.lab[
- Let's set a few variables for convenience, and create our repository:
```bash
export GITHUB_TOKEN=...
export GITHUB_USER=changeme
export GITHUB_REPO=alsochangeme
export FLUX_CLUSTER=dev
flux bootstrap github \
--owner=$GITHUB_USER \
--repository=$GITHUB_REPO \
--branch=main \
--path=./clusters/$FLUX_CLUSTER \
--personal --private=false
```
]
Problems? check next slide!
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## What could go wrong?
- `flux bootstrap` will create or update the repository on GitHub
- Then it will install Flux controllers to our cluster
- Then it waits for these controllers to be up and running and ready
- Check pod status in `flux-system`
- If pods are `Pending`, check that you have enough resources on your cluster
- For testing purposes, it should be fine to lower or remove Flux `requests`!
(but don't do that in production!)
- If anything goes wrong, don't worry, we can just re-run the bootstrap
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
class: extra-details
## Idempotence
- It's OK to run that same `flux bootstrap` command multiple times!
- If the repository already exists, it will re-use it
(it won't destroy or empty it)
- If the path `./clusters/$FLUX_CLUSTER` already exists, it will update it
- It's totally fine to re-run `flux bootstrap` if something fails
- It's totally fine to run it multiple times on different clusters
- Or even to run it multiple times for the *same* cluster
(to reinstall Flux on that cluster after a cluster wipe / reinstall)
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## What do we get?
- Let's look at what `flux bootstrap` installed on the cluster
.lab[
- Look inside the `flux-system` namespace:
```bash
kubectl get all --namespace flux-system
```
- Look at `kustomizations` custom resources:
```bash
kubectl get kustomizations --all-namespaces
```
- See what the `flux` CLI tells us:
```bash
flux get all
```
]
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Deploying with GitOps
- We'll need to add/edit files on the repository
- We can do it by using `git clone`, local edits, `git commit`, `git push`
- Or by editing online on the GitHub website
.lab[
- Create a manifest; for instance `clusters/dev/flux-system/blue.yaml`
- Add that manifest to `clusters/dev/kustomization.yaml`
- Commit and push both changes to the repository
]
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Waiting for reconciliation
- Compare the git hash that we pushed and the one shown with `kubectl get `
- Option 1: wait for Flux to pick up the changes in the repository
(the default interval for git repositories is 1 minute, so that's fast)
- Option 2: use `flux reconcile source git flux-system`
(this puts an annotation on the appropriate resource, triggering an immediate check)
- Option 3: set up receiver webhooks
(so that git updates trigger immediate reconciliation)
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Checking progress
- `flux logs`
- `kubectl get gitrepositories --all-namespaces`
- `kubectl get kustomizations --all-namespaces`
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Did it work?
--
- No!
--
- Why?
--
- We need to indicate the namespace where the app should be deployed
- Either in the YAML manifests
- Or in the `kustomization` custom resource
(using field `spec.targetNamespace`)
- Add the namespace to the manifest and try again!
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Adding an app in a reusable way
- Let's see a technique to add a whole app
(with multiple resource manifets)
- We want to minimize code repetition
(i.e. easy to add on multiple clusters with minimal changes)
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## The plan
- Add the app manifests in a directory
(e.g.: `apps/myappname/manifests`)
- Create a kustomization manifest for the app and its namespace
(e.g.: `apps/myappname/flux.yaml`)
- The kustomization manifest will refer to the app manifest
- Add the kustomization manifest to the top-level `flux-system` kustomization
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Creating the manifests
- All commands below should be executed at the root of the repository
.lab[
- Put application manifests in their directory:
```bash
mkdir -p apps/dockercoins/manifests
cp ~/container.training/k8s/dockercoins.yaml apps/dockercoins/manifests
```
- Create kustomization manifest:
```bash
flux create kustomization dockercoins \
--source=GitRepository/flux-system \
--path=./apps/dockercoins/manifests/ \
--target-namespace=dockercoins \
--prune=true --export > apps/dockercoins/flux.yaml
```
]
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Creating the target namespace
- When deploying *helm releases*, it is possible to automatically create the namespace
- When deploying *kustomizations*, we need to create it explicitly
- Let's put the namespace with the kustomization manifest
(so that the whole app can be mediated through a single manifest)
.lab[
- Add the target namespace to the kustomization manifest:
```bash
echo "---
kind: Namespace
apiVersion: v1
metadata:
name: dockercoins" >> apps/dockercoins/flux.yaml
```
]
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Linking the kustomization manifest
- Edit `clusters/dev/flux-system/kustomization.yaml`
- Add a line to reference the kustomization manifest that we created:
```yaml
- ../../../apps/dockercoins/flux.yaml
```
- `git add` our manifests, `git commit`, `git push`
(check with `git status` that we haven't forgotten anything!)
- `flux reconcile` or wait for the changes to be picked up
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Installing with Helm
- We're going to see two different workflows:
- installing a third-party chart
(e.g. something we found on the Artifact Hub)
- installing one of our own charts
(e.g. a chart we authored ourselves)
- The procedures are very similar
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Installing from a public Helm repository
- Let's install [kube-prometheus-stack][kps]
.lab[
- Create the Flux manifests:
```bash
mkdir -p apps/kube-prometheus-stack
flux create source helm kube-prometheus-stack \
--url=https://prometheus-community.github.io/helm-charts \
--export >> apps/kube-prometheus-stack/flux.yaml
flux create helmrelease kube-prometheus-stack \
--source=HelmRepository/kube-prometheus-stack \
--chart=kube-prometheus-stack --release-name=kube-prometheus-stack \
--target-namespace=kube-prometheus-stack --create-target-namespace \
--export >> apps/kube-prometheus-stack/flux.yaml
```
]
[kps]: https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Enable the app
- Just like before, link the manifest from the top-level kustomization
(`flux-system` in namespace `flux-system`)
- `git add` / `git commit` / `git push`
- We should now have a Prometheus+Grafana observability stack!
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Installing from a Helm chart in a git repo
- In this example, the chart will be in the same repo
- In the real world, it will typically be in a different repo!
.lab[
- Generate a basic Helm chart:
```bash
mkdir -p charts
helm create charts/myapp
```
]
(This generates a chart which installs NGINX. A lot of things can be customized, though.)
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Creating the Flux manifests
- The invocation is very similar to our first example
.lab[
- Generate the Flux manifest for the Helm release:
```bash
mkdir apps/myapp
flux create helmrelease myapp \
--source=GitRepository/flux-system \
--chart=charts/myapp \
--target-namespace=myapp --create-target-namespace \
--export > apps/myapp/flux.yaml
```
- Add a reference to that manifest to the top-level kustomization
- `git add` / `git commit` / `git push` the chart, manifest, and kustomization
]
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Passing values
- We can also configure our Helm releases with values
- Using an existing `myvalues.yaml` file:
`flux create helmrelease ... --values=myvalues.yaml`
- Referencing an existing ConfigMap or Secret with a `values.yaml` key:
`flux create helmrelease ... --values-from=ConfigMap/myapp`
- The ConfigMap or Secret must be in the same Namespace as the HelmRelease
(not the target namespace of that HelmRelease!)
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Gotchas
- When creating a HelmRelease using a chart stored in a git repository, you must:
- either bump the chart version (in `Chart.yaml`) after each change,
- or set `spec.chart.spec.reconcileStrategy` to `Revision`
- Why?
- Flux installs helm releases using packaged artifacts
- Artifacts are updated only when the Helm chart version changes
- Unless `reconcileStrategy` is set to `Revision` (instead of the default `ChartVersion`)
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## More gotchas
- There is a bug in Flux that prevents using identical subcharts with aliases
- See [fluxcd/flux2#2505][flux2505] for details
[flux2505]: https://github.com/fluxcd/flux2/discussions/2505
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
## Things that we didn't talk about...
- Bucket sources
- Image automation controller
- Image reflector controller
- And more!
???
:EN:- Implementing gitops with Flux
:FR:- Workflow gitops avec Flux
.debug[[k8s/flux.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/flux.md)]
---
class: pic
.interstitial[]
---
name: toc-argocd
class: title
ArgoCD
.nav[
[Previous part](#toc-fluxcd)
|
[Back to table of contents](#toc-part-11)
|
[Next part](#toc-centralized-logging)
]
.debug[(automatically generated title slide)]
---
# ArgoCD
- We're going to implement a basic GitOps workflow with ArgoCD
- Pushing to the default branch will automatically deploy to our clusters
- There will be two clusters (`dev` and `prod`)
- The two clusters will have similar (but slightly different) workloads

.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## ArgoCD concepts
ArgoCD manages **applications** by **syncing** their **live state** with their **target state**.
- **Application**: a group of Kubernetes resources managed by ArgoCD.
Also a custom resource (`kind: Application`) managing that group of resources.
- **Application source type**: the **Tool** used to build the application (Kustomize, Helm...)
- **Target state**: the desired state of an **application**, as represented by the git repository.
- **Live state**: the current state of the application on the cluster.
- **Sync status**: whether or not the live state matches the target state.
- **Sync**: the process of making an application move to its target state.
(e.g. by applying changes to a Kubernetes cluster)
(Check [ArgoCD core concepts](https://argo-cd.readthedocs.io/en/stable/core_concepts/) for more definitions!)
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Getting ready
- Let's make sure we have two clusters
- It's OK to use local clusters (kind, minikube...)
- We need to install the ArgoCD CLI ([argocd-packages], [argocd-binaries])
- **Highly recommended:** set up CLI completion!
- Of course we'll need a Git service, too
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Setting up ArgoCD
- The easiest way is to use upstream YAML manifests
- There is also a [Helm chart][argocd-helmchart] if we need more customization
.lab[
- Create a namespace for ArgoCD and install it there:
```bash
kubectl create namespace argocd
kubectl apply --namespace argocd -f \
https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
```
]
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Logging in with the ArgoCD CLI
- The CLI can talk to the ArgoCD API server or to the Kubernetes API server
- For simplicity, we're going to authenticate and communicate with the Kubernetes API
.lab[
- Authenticate with the ArgoCD API (that's what the `--core` flag does):
```bash
argocd login --core
```
- Check that everything is fine:
```bash
argocd version
```
]
--
🤔 `FATA[0000] error retrieving argocd-cm: configmap "argocd-cm" not found`
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## ArgoCD CLI shortcomings
- When using "core" authentication, the ArgoCD CLI uses our current Kubernetes context
(as defined in our kubeconfig file)
- That context need to point to the correct namespace
(the namespace where we installed ArgoCD)
- In fact, `argocd login --core` doesn't communicate at all with ArgoCD!
(it only updates a local ArgoCD configuration file)
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Trying again in the right namespace
- We will need to run all `argocd` commands in the `argocd` namespace
(this limitation only applies to "core" authentication; see [issue 14167][issue14167])
.lab[
- Switch to the `argocd` namespace:
```bash
kubectl config set-context --current --namespace argocd
```
- Check that we can communicate with the ArgoCD API now:
```bash
argocd version
```
]
- Let's have a look at ArgoCD architecture!
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
class: pic

.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## ArgoCD API Server
The API server is a gRPC/REST server which exposes the API consumed by the Web UI, CLI, and CI/CD systems. It has the following responsibilities:
- application management and status reporting
- invoking of application operations (e.g. sync, rollback, user-defined actions)
- repository and cluster credential management (stored as K8s secrets)
- authentication and auth delegation to external identity providers
- RBAC enforcement
- listener/forwarder for Git webhook events
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## ArgoCD Repository Server
The repository server is an internal service which maintains a local cache of the Git repositories holding the application manifests. It is responsible for generating and returning the Kubernetes manifests when provided the following inputs:
- repository URL
- revision (commit, tag, branch)
- application path
- template specific settings: parameters, helm values...
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## ArgoCD Application Controller
The application controller is a Kubernetes controller which continuously monitors running applications and compares the current, live state against the desired target state (as specified in the repo).
It detects *OutOfSync* application state and optionally takes corrective action.
It is responsible for invoking any user-defined hooks for lifecycle events (*PreSync, Sync, PostSync*).
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Preparing a repository for ArgoCD
- We need a repository with Kubernetes YAML manifests
- You can fork [kubercoins] or create a new, empty repository
- If you create a new, empty repository, add some manifests to it
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Add an Application
- An Application can be added to ArgoCD via the web UI or the CLI
(either way, this will create a custom resource of `kind: Application`)
- The Application should then automatically be deployed to our cluster
(the application manifests will be "applied" to the cluster)
.lab[
- Let's use the CLI to add an Application:
```bash
argocd app create kubercoins \
--repo https://github.com/`/`.git \
--path . --revision `` \
--dest-server https://kubernetes.default.svc \
--dest-namespace kubercoins-prod
```
]
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Checking progress
- We can see sync status in the web UI or with the CLI
.lab[
- Let's check app status with the CLI:
```bash
argocd app list
```
- We can also check directly with the Kubernetes CLI:
```bash
kubectl get applications
```
]
- The app is there and it is `OutOfSync`!
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Manual sync with the CLI
- By default the "sync policy" is `manual`
- It can also be set to `auto`, which would check the git repository every 3 minutes
(this interval can be [configured globally][pollinginterval])
- Manual sync can be triggered with the CLI
.lab[
- Let's force an immediate sync of our app:
```bash
argocd app sync kubercoins
```
]
🤔 We're getting errors!
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Sync failed
We should receive a failure:
`FATA[0000] Operation has completed with phase: Failed`
And in the output, we see more details:
`Message: one or more objects failed to apply,`
`reason: namespaces "kubercoins-prod" not found`
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Creating the namespace
- There are multiple ways to achieve that
- We could generate a YAML manifest for the namespace and add it to the git repository
- Or we could use "Sync Options" so that ArgoCD creates it automatically!
- ArgoCD provides many "Sync Options" to handle various edge cases
- Some [others](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-options/) are: `FailOnSharedResource`, `PruneLast`, `PrunePropagationPolicy`...
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Editing the app's sync options
- This can be done through the web UI or the CLI
.lab[
- Let's use the CLI once again:
```bash
argocd app edit kubercoins
```
- Add the following to the YAML manifest, at the root level:
```yaml
syncPolicy:
syncOptions:
- CreateNamespace=true
```
]
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Sync again
.lab[
- Let's retry the sync operation:
```bash
argocd app sync kubercoins
```
- And check the application status:
```bash
argocd app list
kubectl get applications
```
]
- It should show `Synced` and `Progressing`
- After a while (when all pods are running correctly) it should be `Healthy`
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Managing Applications via the Web UI
- ArgoCD is popular in large part due to its browser-based UI
- Let's see how to manage Applications in the web UI
.lab[
- Expose the web dashboard on a local port:
```bash
argocd admin dashboard
```
- This command will show the dashboard URL; open it in a browser
- Authentication should be automatic
]
Note: `argocd admin dashboard` is similar to `kubectl port-forward` or `kubectl-proxy`.
(The dashboard remains available as long as `argocd admin dashboard` is running.)
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Adding a staging Application
- Let's add another Application for a staging environment
- First, create a new branch (e.g. `staging`) in our kubercoins fork
- Then, in the ArgoCD web UI, click on the "+ NEW APP" button
(on a narrow display, it might just be "+", right next to buttons looking like 🔄 and ↩️)
- See next slides for details about that form!
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Defining the Application
| Field | Value |
|------------------|--------------------------------------------|
| Application Name | `kubercoins-stg` |
| Project Name | `default` |
| Sync policy | `Manual` |
| Sync options | check `auto-create namespace` |
| Repository URL | `https://github.com//` |
| Revision | `` |
| Path | `.` |
| Cluster URL | `https://kubernetes.default.svc` |
| Namespace | `kubercoins-stg` |
Then click on the "CREATE" button (top left).
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Synchronizing the Application
- After creating the app, it should now show up in the app tiles
(with a yellow outline to indicate that it's out of sync)
- Click on the "SYNC" button on the app tile to show the sync panel
- In the sync panel, click on "SYNCHRONIZE"
- The app will start to synchronize, and should become healthy after a little while
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Making changes
- Let's make changes to our application manifests and see what happens
.lab[
- Make a change to a manifest
(for instance, change the number of replicas of a Deployment)
- Commit that change and push it to the staging branch
- Check the application sync status:
```bash
argocd app list
```
]
- After a short period of time (a few minutes max) the app should show up "out of sync"
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Automated synchronization
- We don't want to manually sync after every change
(that wouldn't be true continuous deployment!)
- We're going to enable "auto sync"
- Note that this requires much more rigorous testing and observability!
(we need to be sure that our changes won't crash our app or even our cluster)
- Argo project also provides [Argo Rollouts][rollouts]
(a controller and CRDs to provide blue-green, canary deployments...)
- Today we'll just turn on automated sync for the staging namespace
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Enabling auto-sync
- In the web UI, go to *Applications* and click on *kubercoins-stg*
- Click on the "DETAILS" button (top left, might be just a "i" sign on narrow displays)
- Click on "ENABLE AUTO-SYNC" (under "SYNC POLICY")
- After a few minutes the changes should show up!
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Rolling back
- If we deploy a broken version, how do we recover?
- "The GitOps way": revert the changes in source control
(see next slide)
- Emergency rollback:
- disable auto-sync (if it was enabled)
- on the app page, click on "HISTORY AND ROLLBACK"
(with the clock-with-backward-arrow icon)
- click on the "..." button next to the button we want to roll back to
- click "Rollback" and confirm
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Rolling back with GitOps
- The correct way to roll back is rolling back the code in source control
```bash
git checkout staging
git revert HEAD
git push origin staging
```
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Working with Helm
- ArgoCD supports different tools to process Kubernetes manifests:
Kustomize, Helm, Jsonnet, and [Config Management Plugins][cmp]
- Let's how to deploy Helm charts with ArgoCD!
- In the [kubercoins] repository, there is a branch called [helm-branch]
- It provides a generic Helm chart, in the [generic-service] directory
- There are service-specific values YAML files in the [values] directory
- Let's create one application for each of the 5 components of our app!
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Creating a Helm Application
- The example below uses "upstream" kubercoins
- Feel free to use your own fork instead!
.lab[
- Create an Application for `hasher`:
```bash
argocd app create hasher \
--repo https://github.com/jpetazzo/kubercoins.git \
--path generic-service --revision helm \
--dest-server https://kubernetes.default.svc \
--dest-namespace kubercoins-helm \
--sync-option CreateNamespace=true \
--values ../values/hasher.yaml \
--sync-policy=auto
```
]
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Deploying the rest of the application
- Option 1: repeat the previous command (updating app name and values)
- Option 2: author YAML manifests and apply them
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Additional considerations
- When running in production, ArgoCD can be integrated with an [SSO provider][sso]
- ArgoCD embeds and bundles [Dex] to delegate authentication
- it can also use an existing OIDC provider (Okta, Keycloak...)
- A single ArgoCD instance can manage multiple clusters
(but it's also fine to have one ArgoCD per cluster)
- ArgoCD can be complemented with [Argo Rollouts][rollouts] for advanced rollout control
(blue/green, canary...)
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
## Acknowledgements
Many thanks to
Anton (Ant) Weiss ([antweiss.com](https://antweiss.com), [@antweiss](https://twitter.com/antweiss))
and
Guilhem Lettron
for contributing an initial version and suggestions to this ArgoCD chapter.
All remaining typos, mistakes, or approximations are mine (Jérôme Petazzoni).
[argocd-binaries]: https://github.com/argoproj/argo-cd/releases/latest
[argocd-helmchart]: https://artifacthub.io/packages/helm/argo/argocd-apps
[argocd-packages]: https://argo-cd.readthedocs.io/en/stable/cli_installation/
[cmp]: https://argo-cd.readthedocs.io/en/stable/operator-manual/config-management-plugins/
[Dex]: https://github.com/dexidp/dex
[generic-service]: https://github.com/jpetazzo/kubercoins/tree/helm/generic-service
[helm-branch]: https://github.com/jpetazzo/kubercoins/tree/helm
[issue14167]: https://github.com/argoproj/argo-cd/issues/14167
[kubercoins]: https://github.com/jpetazzo/kubercoins
[pollinginterval]: https://argo-cd.readthedocs.io/en/stable/faq/#how-often-does-argo-cd-check-for-changes-to-my-git-or-helm-repository
[rollouts]: https://argoproj.github.io/rollouts/
[sso]: https://argo-cd.readthedocs.io/en/stable/operator-manual/user-management/#sso
[values]: https://github.com/jpetazzo/kubercoins/tree/helm/values
???
:EN:- Implementing gitops with ArgoCD
:FR:- Workflow gitops avec ArgoCD
.debug[[k8s/argocd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/argocd.md)]
---
class: pic
.interstitial[]
---
name: toc-centralized-logging
class: title
Centralized logging
.nav[
[Previous part](#toc-argocd)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-collecting-metrics-with-prometheus)
]
.debug[(automatically generated title slide)]
---
# Centralized logging
- Using `kubectl` or `stern` is simple; but it has drawbacks:
- when a node goes down, its logs are not available anymore
- we can only dump or stream logs; we want to search/index/count...
- We want to send all our logs to a single place
- We want to parse them (e.g. for HTTP logs) and index them
- We want a nice web dashboard
--
- We are going to deploy an EFK stack
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## What is EFK?
- EFK is three components:
- ElasticSearch (to store and index log entries)
- Fluentd (to get container logs, process them, and put them in ElasticSearch)
- Kibana (to view/search log entries with a nice UI)
- The only component that we need to access from outside the cluster will be Kibana
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## Deploying EFK on our cluster
- We are going to use a YAML file describing all the required resources
.lab[
- Load the YAML file into our cluster:
```bash
kubectl apply -f ~/container.training/k8s/efk.yaml
```
]
If we [look at the YAML file](https://github.com/jpetazzo/container.training/blob/master/k8s/efk.yaml), we see that
it creates a daemon set, two deployments, two services,
and a few roles and role bindings (to give fluentd the required permissions).
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## The itinerary of a log line (before Fluentd)
- A container writes a line on stdout or stderr
- Both are typically piped to the container engine (Docker or otherwise)
- The container engine reads the line, and sends it to a logging driver
- The timestamp and stream (stdout or stderr) is added to the log line
- With the default configuration for Kubernetes, the line is written to a JSON file
(`/var/log/containers/pod-name_namespace_container-id.log`)
- That file is read when we invoke `kubectl logs`; we can access it directly too
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## The itinerary of a log line (with Fluentd)
- Fluentd runs on each node (thanks to a daemon set)
- It bind-mounts `/var/log/containers` from the host (to access these files)
- It continuously scans this directory for new files; reads them; parses them
- Each log line becomes a JSON object, fully annotated with extra information:
container id, pod name, Kubernetes labels...
- These JSON objects are stored in ElasticSearch
- ElasticSearch indexes the JSON objects
- We can access the logs through Kibana (and perform searches, counts, etc.)
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## Accessing Kibana
- Kibana offers a web interface that is relatively straightforward
- Let's check it out!
.lab[
- Check which `NodePort` was allocated to Kibana:
```bash
kubectl get svc kibana
```
- With our web browser, connect to Kibana
]
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## Using Kibana
*Note: this is not a Kibana workshop! So this section is deliberately very terse.*
- The first time you connect to Kibana, you must "configure an index pattern"
- Just use the one that is suggested, `@timestamp`.red[*]
- Then click "Discover" (in the top-left corner)
- You should see container logs
- Advice: in the left column, select a few fields to display, e.g.:
`kubernetes.host`, `kubernetes.pod_name`, `stream`, `log`
.red[*]If you don't see `@timestamp`, it's probably because no logs exist yet.
Wait a bit, and double-check the logging pipeline!
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
## Caveat emptor
We are using EFK because it is relatively straightforward
to deploy on Kubernetes, without having to redeploy or reconfigure
our cluster. But it doesn't mean that it will always be the best
option for your use-case. If you are running Kubernetes in the
cloud, you might consider using the cloud provider's logging
infrastructure (if it can be integrated with Kubernetes).
The deployment method that we will use here has been simplified:
there is only one ElasticSearch node. In a real deployment, you
might use a cluster, both for performance and reliability reasons.
But this is outside of the scope of this chapter.
The YAML file that we used creates all the resources in the
`default` namespace, for simplicity. In a real scenario, you will
create the resources in the `kube-system` namespace or in a dedicated namespace.
???
:EN:- Centralizing logs
:FR:- Centraliser les logs
.debug[[k8s/logs-centralized.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/logs-centralized.md)]
---
class: pic
.interstitial[]
---
name: toc-collecting-metrics-with-prometheus
class: title
Collecting metrics with Prometheus
.nav[
[Previous part](#toc-centralized-logging)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-prometheus-and-grafana)
]
.debug[(automatically generated title slide)]
---
# Collecting metrics with Prometheus
- Prometheus is an open-source monitoring system including:
- multiple *service discovery* backends to figure out which metrics to collect
- a *scraper* to collect these metrics
- an efficient *time series database* to store these metrics
- a specific query language (PromQL) to query these time series
- an *alert manager* to notify us according to metrics values or trends
- We are going to use it to collect and query some metrics on our Kubernetes cluster
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Why Prometheus?
- We don't endorse Prometheus more or less than any other system
- It's relatively well integrated within the cloud-native ecosystem
- It can be self-hosted (this is useful for tutorials like this)
- It can be used for deployments of varying complexity:
- one binary and 10 lines of configuration to get started
- all the way to thousands of nodes and millions of metrics
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Exposing metrics to Prometheus
- Prometheus obtains metrics and their values by querying *exporters*
- An exporter serves metrics over HTTP, in plain text
- This is what the *node exporter* looks like:
http://demo.robustperception.io:9100/metrics
- Prometheus itself exposes its own internal metrics, too:
http://demo.robustperception.io:9090/metrics
- If you want to expose custom metrics to Prometheus:
- serve a text page like these, and you're good to go
- libraries are available in various languages to help with quantiles etc.
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## How Prometheus gets these metrics
- The *Prometheus server* will *scrape* URLs like these at regular intervals
(by default: every minute; can be more/less frequent)
- The list of URLs to scrape (the *scrape targets*) is defined in configuration
.footnote[Worried about the overhead of parsing a text format?
Check this [comparison](https://github.com/RichiH/OpenMetrics/blob/master/markdown/protobuf_vs_text.md) of the text format with the (now deprecated) protobuf format!]
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Defining scrape targets
This is maybe the simplest configuration file for Prometheus:
```yaml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
```
- In this configuration, Prometheus collects its own internal metrics
- A typical configuration file will have multiple `scrape_configs`
- In this configuration, the list of targets is fixed
- A typical configuration file will use dynamic service discovery
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Service discovery
This configuration file will leverage existing DNS `A` records:
```yaml
scrape_configs:
- ...
- job_name: 'node'
dns_sd_configs:
- names: ['api-backends.dc-paris-2.enix.io']
type: 'A'
port: 9100
```
- In this configuration, Prometheus resolves the provided name(s)
(here, `api-backends.dc-paris-2.enix.io`)
- Each resulting IP address is added as a target on port 9100
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Dynamic service discovery
- In the DNS example, the names are re-resolved at regular intervals
- As DNS records are created/updated/removed, scrape targets change as well
- Existing data (previously collected metrics) is not deleted
- Other service discovery backends work in a similar fashion
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Other service discovery mechanisms
- Prometheus can connect to e.g. a cloud API to list instances
- Or to the Kubernetes API to list nodes, pods, services ...
- Or a service like Consul, Zookeeper, etcd, to list applications
- The resulting configurations files are *way more complex*
(but don't worry, we won't need to write them ourselves)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Time series database
- We could wonder, "why do we need a specialized database?"
- One metrics data point = metrics ID + timestamp + value
- With a classic SQL or noSQL data store, that's at least 160 bits of data + indexes
- Prometheus is way more efficient, without sacrificing performance
(it will even be gentler on the I/O subsystem since it needs to write less)
- Would you like to know more? Check this video:
[Storage in Prometheus 2.0](https://www.youtube.com/watch?v=C4YV-9CrawA) by [Goutham V](https://twitter.com/putadent) at DC17EU
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Checking if Prometheus is installed
- Before trying to install Prometheus, let's check if it's already there
.lab[
- Look for services with a label `app=prometheus` across all namespaces:
```bash
kubectl get services --selector=app=prometheus --all-namespaces
```
]
If we see a `NodePort` service called `prometheus-server`, we're good!
(We can then skip to "Connecting to the Prometheus web UI".)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Running Prometheus on our cluster
We need to:
- Run the Prometheus server in a pod
(using e.g. a Deployment to ensure that it keeps running)
- Expose the Prometheus server web UI (e.g. with a NodePort)
- Run the *node exporter* on each node (with a Daemon Set)
- Set up a Service Account so that Prometheus can query the Kubernetes API
- Configure the Prometheus server
(storing the configuration in a Config Map for easy updates)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Helm charts to the rescue
- To make our lives easier, we are going to use a Helm chart
- The Helm chart will take care of all the steps explained above
(including some extra features that we don't need, but won't hurt)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Step 1: install Helm
- If we already installed Helm earlier, this command won't break anything
.lab[
- Install the Helm CLI:
```bash
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get-helm-3 \
| bash
```
]
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Step 2: install Prometheus
- The following command, just like the previous ones, is idempotent
(it won't error out if Prometheus is already installed)
.lab[
- Install Prometheus on our cluster:
```bash
helm upgrade prometheus --install prometheus \
--repo https://prometheus-community.github.io/helm-charts \
--namespace prometheus --create-namespace \
--set server.service.type=NodePort \
--set server.service.nodePort=30090 \
--set server.persistentVolume.enabled=false \
--set alertmanager.enabled=false
```
]
Curious about all these flags? They're explained in the next slide.
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## Explaining all the Helm flags
- `helm upgrade prometheus` → upgrade the release named `prometheus`
(a "release" is an instance of an app deployed with Helm)
- `--install` → if it doesn't exist, install it (instead of upgrading)
- `prometheus` → use the chart named `prometheus`
- `--repo ...` → the chart is located on the following repository
- `--namespace prometheus` → put it in that specific namespace
- `--create-namespace` → create the namespace if it doesn't exist
- `--set ...` → here are some *values* to be used when rendering the chart's templates
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## Values for the Prometheus chart
Helm *values* are parameters to customize our installation.
- `server.service.type=NodePort` → expose the Prometheus server with a NodePort
- `server.service.nodePort=30090` → set the specific NodePort number to use
- `server.persistentVolume.enabled=false` → do not use a PersistentVolumeClaim
- `alertmanager.enabled=false` → disable the alert manager entirely
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Connecting to the Prometheus web UI
- Let's connect to the web UI and see what we can do
.lab[
- Figure out the NodePort that was allocated to the Prometheus server:
```bash
kubectl get svc --all-namespaces | grep prometheus-server
```
- With your browser, connect to that port
- It should be 30090 if we just installed Prometheus with the Helm chart!
]
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Querying some metrics
- This is easy... if you are familiar with PromQL
.lab[
- Click on "Graph", and in "expression", paste the following:
```
sum by (instance) (
irate(
container_cpu_usage_seconds_total{
pod=~"worker.*"
}[5m]
)
)
```
]
- Click on the blue "Execute" button and on the "Graph" tab just below
- We see the cumulated CPU usage of worker pods for each node
(if we just deployed Prometheus, there won't be much data to see, though)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Getting started with PromQL
- We can't learn PromQL in just 5 minutes
- But we can cover the basics to get an idea of what is possible
(and have some keywords and pointers)
- We are going to break down the query above
(building it one step at a time)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Graphing one metric across all tags
This query will show us CPU usage across all containers:
```
container_cpu_usage_seconds_total
```
- The suffix of the metrics name tells us:
- the unit (seconds of CPU)
- that it's the total used since the container creation
- Since it's a "total," it is an increasing quantity
(we need to compute the derivative if we want e.g. CPU % over time)
- We see that the metrics retrieved have *tags* attached to them
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Selecting metrics with tags
This query will show us only metrics for worker containers:
```
container_cpu_usage_seconds_total{pod=~"worker.*"}
```
- The `=~` operator allows regex matching
- We select all the pods with a name starting with `worker`
(it would be better to use labels to select pods; more on that later)
- The result is a smaller set of containers
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Transforming counters in rates
This query will show us CPU usage % instead of total seconds used:
```
100*irate(container_cpu_usage_seconds_total{pod=~"worker.*"}[5m])
```
- The [`irate`](https://prometheus.io/docs/prometheus/latest/querying/functions/#irate) operator computes the "per-second instant rate of increase"
- `rate` is similar but allows decreasing counters and negative values
- with `irate`, if a counter goes back to zero, we don't get a negative spike
- The `[5m]` tells how far to look back if there is a gap in the data
- And we multiply with `100*` to get CPU % usage
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Aggregation operators
This query sums the CPU usage per node:
```
sum by (instance) (
irate(container_cpu_usage_seconds_total{pod=~"worker.*"}[5m])
)
```
- `instance` corresponds to the node on which the container is running
- `sum by (instance) (...)` computes the sum for each instance
- Note: all the other tags are collapsed
(in other words, the resulting graph only shows the `instance` tag)
- PromQL supports many more [aggregation operators](https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## What kind of metrics can we collect?
- Node metrics (related to physical or virtual machines)
- Container metrics (resource usage per container)
- Databases, message queues, load balancers, ...
(check out this [list of exporters](https://prometheus.io/docs/instrumenting/exporters/)!)
- Instrumentation (=deluxe `printf` for our code)
- Business metrics (customers served, revenue, ...)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## Node metrics
- CPU, RAM, disk usage on the whole node
- Total number of processes running, and their states
- Number of open files, sockets, and their states
- I/O activity (disk, network), per operation or volume
- Physical/hardware (when applicable): temperature, fan speed...
- ...and much more!
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## Container metrics
- Similar to node metrics, but not totally identical
- RAM breakdown will be different
- active vs inactive memory
- some memory is *shared* between containers, and specially accounted for
- I/O activity is also harder to track
- async writes can cause deferred "charges"
- some page-ins are also shared between containers
For details about container metrics, see:
http://jpetazzo.github.io/2013/10/08/docker-containers-metrics/
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## Application metrics
- Arbitrary metrics related to your application and business
- System performance: request latency, error rate...
- Volume information: number of rows in database, message queue size...
- Business data: inventory, items sold, revenue...
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## Detecting scrape targets
- Prometheus can leverage Kubernetes service discovery
(with proper configuration)
- Services or pods can be annotated with:
- `prometheus.io/scrape: true` to enable scraping
- `prometheus.io/port: 9090` to indicate the port number
- `prometheus.io/path: /metrics` to indicate the URI (`/metrics` by default)
- Prometheus will detect and scrape these (without needing a restart or reload)
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## Querying labels
- What if we want to get metrics for containers belonging to a pod tagged `worker`?
- The cAdvisor exporter does not give us Kubernetes labels
- Kubernetes labels are exposed through another exporter
- We can see Kubernetes labels through metrics `kube_pod_labels`
(each container appears as a time series with constant value of `1`)
- Prometheus *kind of* supports "joins" between time series
- But only if the names of the tags match exactly
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: extra-details
## What if the tags don't match?
- Older versions of cAdvisor exporter used tag `pod_name` for the name of a pod
- The Kubernetes service endpoints exporter uses tag `pod` instead
- See [this blog post](https://www.robustperception.io/exposing-the-software-version-to-prometheus) or [this other one](https://www.weave.works/blog/aggregating-pod-resource-cpu-memory-usage-arbitrary-labels-prometheus/) to see how to perform "joins"
- Note that Prometheus cannot "join" time series with different labels
(see [Prometheus issue #2204](https://github.com/prometheus/prometheus/issues/2204) for the rationale)
- There is a workaround involving relabeling, but it's "not cheap"
- see [this comment](https://github.com/prometheus/prometheus/issues/2204#issuecomment-261515520) for an overview
- or [this blog post](https://5pi.de/2017/11/09/use-prometheus-vector-matching-to-get-kubernetes-utilization-across-any-pod-label/) for a complete description of the process
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
## In practice
- Grafana is a beautiful (and useful) frontend to display all kinds of graphs
- Not everyone needs to know Prometheus, PromQL, Grafana, etc.
- But in a team, it is valuable to have at least one person who know them
- That person can set up queries and dashboards for the rest of the team
- It's a little bit like knowing how to optimize SQL queries, Dockerfiles...
Don't panic if you don't know these tools!
...But make sure at least one person in your team is on it 💯
???
:EN:- Collecting metrics with Prometheus
:FR:- Collecter des métriques avec Prometheus
.debug[[k8s/prometheus.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus.md)]
---
class: pic
.interstitial[]
---
name: toc-prometheus-and-grafana
class: title
Prometheus and Grafana
.nav[
[Previous part](#toc-collecting-metrics-with-prometheus)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-resource-limits)
]
.debug[(automatically generated title slide)]
---
# Prometheus and Grafana
- What if we want metrics retention, view graphs, trends?
- A very popular combo is Prometheus+Grafana:
- Prometheus as the "metrics engine"
- Grafana to display comprehensive dashboards
- Prometheus also has an alert-manager component to trigger alerts
(we won't talk about that one)
.debug[[k8s/prometheus-stack.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus-stack.md)]
---
## Installing Prometheus and Grafana
- A complete metrics stack needs at least:
- the Prometheus server (collects metrics and stores them efficiently)
- a collection of *exporters* (exposing metrics to Prometheus)
- Grafana
- a collection of Grafana dashboards (building them from scratch is tedious)
- The Helm chart `kube-prometheus-stack` combines all these elements
- ... So we're going to use it to deploy our metrics stack!
.debug[[k8s/prometheus-stack.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus-stack.md)]
---
## Installing `kube-prometheus-stack`
- Let's install that stack *directly* from its repo
(without doing `helm repo add` first)
- Otherwise, keep the same naming strategy:
```bash
helm upgrade --install kube-prometheus-stack kube-prometheus-stack \
--namespace kube-prometheus-stack --create-namespace \
--repo https://prometheus-community.github.io/helm-charts
```
- This will take a minute...
- Then check what was installed:
```bash
kubectl get all --namespace kube-prometheus-stack
```
.debug[[k8s/prometheus-stack.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus-stack.md)]
---
## Exposing Grafana
- Let's create an Ingress for Grafana
```bash
kubectl create ingress --namespace kube-prometheus-stack grafana \
--rule=grafana.`cloudnative.party`/*=kube-prometheus-stack-grafana:80
```
(as usual, make sure to use *your* domain name above)
- Connect to Grafana
(remember that the DNS record might take a few minutes to come up)
.debug[[k8s/prometheus-stack.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus-stack.md)]
---
## Grafana credentials
- What could the login and password be?
- Let's look at the Secrets available in the namespace:
```bash
kubectl get secrets --namespace kube-prometheus-stack
```
- There is a `kube-prometheus-stack-grafana` that looks promising!
- Decode the Secret:
```bash
kubectl get secret --namespace kube-prometheus-stack \
kube-prometheus-stack-grafana -o json | jq '.data | map_values(@base64d)'
```
- If you don't have the `jq` tool mentioned above, don't worry...
--
- The login/password is hardcoded to `admin`/`prom-operator` 😬
.debug[[k8s/prometheus-stack.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus-stack.md)]
---
## Grafana dashboards
- Once logged in, click on the "Dashboards" icon on the left
(it's the one that looks like four squares)
- Then click on the "Manage" entry
- Then click on "Kubernetes / Compute Resources / Cluster"
- This gives us a breakdown of resource usage by Namespace
- Feel free to explore the other dashboards!
???
:EN:- Installing Prometheus and Grafana
:FR:- Installer Prometheus et Grafana
:T: Observing our cluster with Prometheus and Grafana
:Q: What's the relationship between Prometheus and Grafana?
:A: Prometheus collects and graphs metrics; Grafana sends alerts
:A: ✔️Prometheus collects metrics; Grafana displays them on dashboards
:A: Prometheus collects and graphs metrics; Grafana is its configuration interface
:A: Grafana collects and graphs metrics; Prometheus sends alerts
.debug[[k8s/prometheus-stack.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/prometheus-stack.md)]
---
class: pic
.interstitial[]
---
name: toc-resource-limits
class: title
Resource Limits
.nav[
[Previous part](#toc-prometheus-and-grafana)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-defining-min-max-and-default-resources)
]
.debug[(automatically generated title slide)]
---
# Resource Limits
- We can attach resource indications to our pods
(or rather: to the *containers* in our pods)
- We can specify *limits* and/or *requests*
- We can specify quantities of CPU and/or memory and/or ephemeral storage
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Requests vs limits
- *Requests* are *guaranteed reservations* of resources
- They are used for scheduling purposes
- Kubelet will use cgroups to e.g. guarantee a minimum amount of CPU time
- A container **can** use more than its requested resources
- A container using *less* than what it requested should never be killed or throttled
- A node **cannot** be overcommitted with requests
(the sum of all requests **cannot** be higher than resources available on the node)
- A small amount of resources is set aside for system components
(this explains why there is a difference between "capacity" and "allocatable")
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Requests vs limits
- *Limits* are "hard limits" (a container **cannot** exceed its limits)
- They aren't taken into account by the scheduler
- A container exceeding its memory limit is killed instantly
(by the kernel out-of-memory killer)
- A container exceeding its CPU limit is throttled
- A container exceeding its disk limit is killed
(usually with a small delay, since this is checked periodically by kubelet)
- On a given node, the sum of all limits **can** be higher than the node size
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Compressible vs incompressible resources
- CPU is a *compressible resource*
- it can be preempted immediately without adverse effect
- if we have N CPU and need 2N, we run at 50% speed
- Memory is an *incompressible resource*
- it needs to be swapped out to be reclaimed; and this is costly
- if we have N GB RAM and need 2N, we might run at... 0.1% speed!
- Disk is also an *incompressible resource*
- when the disk is full, writes will fail
- applications may or may not crash but persistent apps will be in trouble
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Running low on CPU
- Two ways for a container to "run low" on CPU:
- it's hitting its CPU limit
- all CPUs on the node are at 100% utilization
- The app in the container will run slower
(compared to running without a limit, or if CPU cycles were available)
- No other consequence
(but this could affect SLA/SLO for latency-sensitive applications!)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## CPU limits implementation details
- A container with a CPU limit will be "rationed" by the kernel
- Every `cfs_period_us`, it will receive a CPU quota, like an "allowance"
(that interval defaults to 100ms)
- Once it has used its quota, it will be stalled until the next period
- This can easily result in throttling for bursty workloads
(see details on next slide)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## A bursty example
- Web service receives one request per minute
- Each request takes 1 second of CPU
- Average load: 1.66%
- Let's say we set a CPU limit of 10%
- This means CPU quotas of 10ms every 100ms
- Obtaining the quota for 1 second of CPU will take 10 seconds
- Observed latency will be 10 seconds (... actually 9.9s) instead of 1 second
(real-life scenarios will of course be less extreme, but they do happen!)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## Multi-core scheduling details
- Each core gets a small share of the container's CPU quota
(this avoids locking and contention on the "global" quota for the container)
- By default, the kernel distributes that quota to CPUs in 5ms increments
(tunable with `kernel.sched_cfs_bandwidth_slice_us`)
- If a containerized process (or thread) uses up its local CPU quota:
*it gets more from the "global" container quota (if there's some left)*
- If it "yields" (e.g. sleeps for I/O) before using its local CPU quota:
*the quota is **soon** returned to the "global" container quota, **minus** 1ms*
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## Low quotas on machines with many cores
- The local CPU quota is not immediately returned to the global quota
- this reduces locking and contention on the global quota
- but this can cause starvation when many threads/processes become runnable
- That 1ms that "stays" on the local CPU quota is often useful
- if the thread/process becomes runnable, it can be scheduled immediately
- again, this reduces locking and contention on the global quota
- but if the thread/process doesn't become runnable, it is wasted!
- this can become a huge problem on machines with many cores
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## CPU limits in a nutshell
- Beware if you run small bursty workloads on machines with many cores!
("highly-threaded, user-interactive, non-cpu bound applications")
- Check the `nr_throttled` and `throttled_time` metrics in `cpu.stat`
- Possible solutions/workarounds:
- be generous with the limits
- make sure your kernel has the [appropriate patch](https://lkml.org/lkml/2019/5/17/581)
- use [static CPU manager policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy)
For more details, check [this blog post](https://erickhun.com/posts/kubernetes-faster-services-no-cpu-limits/) or these: ([part 1](https://engineering.indeedblog.com/blog/2019/12/unthrottled-fixing-cpu-limits-in-the-cloud/), [part 2](https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/)).
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Running low on memory
- When the kernel runs low on memory, it starts to reclaim used memory
- Option 1: free up some buffers and caches
(fastest option; might affect performance if cache memory runs very low)
- Option 2: swap, i.e. write to disk some memory of one process to give it to another
(can have a huge negative impact on performance because disks are slow)
- Option 3: terminate a process and reclaim all its memory
(OOM or Out Of Memory Killer on Linux)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Memory limits on Kubernetes
- Kubernetes *does not support swap*
(but it may support it in the future, thanks to [KEP 2400])
- If a container exceeds its memory *limit*, it gets killed immediately
- If a node memory usage gets too high, it will *evict* some pods
(we say that the node is "under pressure", more on that in a bit!)
[KEP 2400]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#implementation-history
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Running low on disk
- When the kubelet runs low on disk, it starts to reclaim disk space
(similarly to what the kernel does, but in different categories)
- Option 1: garbage collect dead pods and containers
(no consequence, but their logs will be deleted)
- Option 2: remove unused images
(no consequence, but these images will have to be repulled if we need them later)
- Option 3: evict pods and remove them to reclaim their disk usage
- Note: this only applies to *ephemeral storage*, not to e.g. Persistent Volumes!
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Ephemeral storage?
- This includes:
- the *read-write layer* of the container
(any file creation/modification outside of its volumes)
- `emptyDir` volumes mounted in the container
- the container logs stored on the node
- This does not include:
- the container image
- other types of volumes (e.g. Persistent Volumes, `hostPath`, or `local` volumes)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## Disk limit enforcement
- Disk usage is periodically measured by kubelet
(with something equivalent to `du`)
- There can be a small delay before pod termination when disk limit is exceeded
- It's also possible to enable filesystem *project quotas*
(e.g. with EXT4 or XFS)
- Remember that container logs are also accounted for!
(container log rotation/retention is managed by kubelet)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## `nodefs` and `imagefs`
- `nodefs` is the main filesystem of the node
(holding, notably, `emptyDir` volumes and container logs)
- Optionally, the container engine can be configured to use an `imagefs`
- `imagefs` will store container images and container writable layers
- When there is a separate `imagefs`, its disk usage is tracked independently
- If `imagefs` usage gets too high, kubelet will remove old images first
(conversely, if `nodefs` usage gets too high, kubelet won't remove old images)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## CPU and RAM reservation
- Kubernetes passes resources requests and limits to the container engine
- The container engine applies these requests and limits with specific mechanisms
- Example: on Linux, this is typically done with control groups aka cgroups
- Most systems use cgroups v1, but cgroups v2 are slowly being rolled out
(e.g. available in Ubuntu 22.04 LTS)
- Cgroups v2 have new, interesting features for memory control:
- ability to set "minimum" memory amounts (to effectively reserve memory)
- better control on the amount of swap used by a container
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## What's the deal with swap?
- With cgroups v1, it's not possible to disable swap for a cgroup
(the closest option is to [reduce "swappiness"](https://unix.stackexchange.com/questions/77939/turning-off-swapping-for-only-one-process-with-cgroups))
- It is possible with cgroups v2 (see the [kernel docs](https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html) and the [fbatx docs](https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html#using-swap))
- Cgroups v2 aren't widely deployed yet
- The architects of Kubernetes wanted to ensure that Guaranteed pods never swap
- The simplest solution was to disable swap entirely
- Kubelet will refuse to start if it detects that swap is enabled!
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Alternative point of view
- Swap enables paging¹ of anonymous² memory
- Even when swap is disabled, Linux will still page memory for:
- executables, libraries
- mapped files
- Disabling swap *will reduce performance and available resources*
- For a good time, read [kubernetes/kubernetes#53533](https://github.com/kubernetes/kubernetes/issues/53533)
- Also read this [excellent blog post about swap](https://jvns.ca/blog/2017/02/17/mystery-swap/)
¹Paging: reading/writing memory pages from/to disk to reclaim physical memory
²Anonymous memory: memory that is not backed by files or blocks
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Enabling swap anyway
- If you don't care that pods are swapping, you can enable swap
- You will need to add the flag `--fail-swap-on=false` to kubelet
(remember: it won't otherwise start if it detects that swap is enabled)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Pod quality of service
Each pod is assigned a QoS class (visible in `status.qosClass`).
- If limits = requests:
- as long as the container uses less than the limit, it won't be affected
- if all containers in a pod have *(limits=requests)*, QoS is considered "Guaranteed"
- If requests < limits:
- as long as the container uses less than the request, it won't be affected
- otherwise, it might be killed/evicted if the node gets overloaded
- if at least one container has *(requests<limits)*, QoS is considered "Burstable"
- If a pod doesn't have any request nor limit, QoS is considered "BestEffort"
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Quality of service impact
- When a node is overloaded, BestEffort pods are killed first
- Then, Burstable pods that exceed their requests
- Burstable and Guaranteed pods below their requests are never killed
(except if their node fails)
- If we only use Guaranteed pods, no pod should ever be killed
(as long as they stay within their limits)
(Pod QoS is also explained in [this page](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) of the Kubernetes documentation and in [this blog post](https://medium.com/google-cloud/quality-of-service-class-qos-in-kubernetes-bb76a89eb2c6).)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Specifying resources
- Resource requests are expressed at the *container* level
- CPU is expressed in "virtual CPUs"
(corresponding to the virtual CPUs offered by some cloud providers)
- CPU can be expressed with a decimal value, or even a "milli" suffix
(so 100m = 0.1)
- Memory and ephemeral disk storage are expressed in bytes
- These can have k, M, G, T, ki, Mi, Gi, Ti suffixes
(corresponding to 10^3, 10^6, 10^9, 10^12, 2^10, 2^20, 2^30, 2^40)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Specifying resources in practice
This is what the spec of a Pod with resources will look like:
```yaml
containers:
- name: blue
image: jpetazzo/color
resources:
limits:
cpu: "100m"
ephemeral-storage: 10M
memory: "100Mi"
requests:
cpu: "10m"
ephemeral-storage: 10M
memory: "100Mi"
```
This set of resources makes sure that this service won't be killed (as long as it stays below 100 MB of RAM), but allows its CPU usage to be throttled if necessary.
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Default values
- If we specify a limit without a request:
the request is set to the limit
- If we specify a request without a limit:
there will be no limit
(which means that the limit will be the size of the node)
- If we don't specify anything:
the request is zero and the limit is the size of the node
*Unless there are default values defined for our namespace!*
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## We need to specify resource values
- If we do not set resource values at all:
- the limit is "the size of the node"
- the request is zero
- This is generally *not* what we want
- a container without a limit can use up all the resources of a node
- if the request is zero, the scheduler can't make a smart placement decision
- This is fine when learning/testing, absolutely not in production!
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## How should we set resources?
- Option 1: manually, for each container
- simple, effective, but tedious
- Option 2: automatically, with the [Vertical Pod Autoscaler (VPA)][vpa]
- relatively simple, very minimal involvement beyond initial setup
- not compatible with HPAv1, can disrupt long-running workloads (see [limitations][vpa-limitations])
- Option 3: semi-automatically, with tools like [Robusta KRR][robusta]
- good compromise between manual work and automation
- Option 4: by creating LimitRanges in our Namespaces
- relatively simple, but "one-size-fits-all" approach might not always work
[robusta]: https://github.com/robusta-dev/krr
[vpa]: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
[vpa-limitations]: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#known-limitations
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-defining-min-max-and-default-resources
class: title
Defining min, max, and default resources
.nav[
[Previous part](#toc-resource-limits)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-namespace-quotas)
]
.debug[(automatically generated title slide)]
---
# Defining min, max, and default resources
- We can create LimitRange objects to indicate any combination of:
- min and/or max resources allowed per pod
- default resource *limits*
- default resource *requests*
- maximal burst ratio (*limit/request*)
- LimitRange objects are namespaced
- They apply to their namespace only
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## LimitRange example
```yaml
apiVersion: v1
kind: LimitRange
metadata:
name: my-very-detailed-limitrange
spec:
limits:
- type: Container
min:
cpu: "100m"
max:
cpu: "2000m"
memory: "1Gi"
default:
cpu: "500m"
memory: "250Mi"
defaultRequest:
cpu: "500m"
```
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Example explanation
The YAML on the previous slide shows an example LimitRange object specifying very detailed limits on CPU usage,
and providing defaults on RAM usage.
Note the `type: Container` line: in the future,
it might also be possible to specify limits
per Pod, but it's not [officially documented yet](https://github.com/kubernetes/website/issues/9585).
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## LimitRange details
- LimitRange restrictions are enforced only when a Pod is created
(they don't apply retroactively)
- They don't prevent creation of e.g. an invalid Deployment or DaemonSet
(but the pods will not be created as long as the LimitRange is in effect)
- If there are multiple LimitRange restrictions, they all apply together
(which means that it's possible to specify conflicting LimitRanges,
preventing any Pod from being created)
- If a LimitRange specifies a `max` for a resource but no `default`,
that `max` value becomes the `default` limit too
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-namespace-quotas
class: title
Namespace quotas
.nav[
[Previous part](#toc-defining-min-max-and-default-resources)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-limiting-resources-in-practice)
]
.debug[(automatically generated title slide)]
---
# Namespace quotas
- We can also set quotas per namespace
- Quotas apply to the total usage in a namespace
(e.g. total CPU limits of all pods in a given namespace)
- Quotas can apply to resource limits and/or requests
(like the CPU and memory limits that we saw earlier)
- Quotas can also apply to other resources:
- "extended" resources (like GPUs)
- storage size
- number of objects (number of pods, services...)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Creating a quota for a namespace
- Quotas are enforced by creating a ResourceQuota object
- ResourceQuota objects are namespaced, and apply to their namespace only
- We can have multiple ResourceQuota objects in the same namespace
- The most restrictive values are used
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Limiting total CPU/memory usage
- The following YAML specifies an upper bound for *limits* and *requests*:
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: a-little-bit-of-compute
spec:
hard:
requests.cpu: "10"
requests.memory: 10Gi
limits.cpu: "20"
limits.memory: 20Gi
```
These quotas will apply to the namespace where the ResourceQuota is created.
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Limiting number of objects
- The following YAML specifies how many objects of specific types can be created:
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota-for-objects
spec:
hard:
pods: 100
services: 10
secrets: 10
configmaps: 10
persistentvolumeclaims: 20
services.nodeports: 0
services.loadbalancers: 0
count/roles.rbac.authorization.k8s.io: 10
```
(The `count/` syntax allows limiting arbitrary objects, including CRDs.)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## YAML vs CLI
- Quotas can be created with a YAML definition
- ...Or with the `kubectl create quota` command
- Example:
```bash
kubectl create quota my-resource-quota --hard=pods=300,limits.memory=300Gi
```
- With both YAML and CLI form, the values are always under the `hard` section
(there is no `soft` quota)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Viewing current usage
When a ResourceQuota is created, we can see how much of it is used:
```
kubectl describe resourcequota my-resource-quota
Name: my-resource-quota
Namespace: default
Resource Used Hard
-------- ---- ----
pods 12 100
services 1 5
services.loadbalancers 0 0
services.nodeports 0 0
```
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Advanced quotas and PriorityClass
- Pods can have a *priority*
- The priority is a number from 0 to 1000000000
(or even higher for system-defined priorities)
- High number = high priority = "more important" Pod
- Pods with a higher priority can *preempt* Pods with lower priority
(= low priority pods will be *evicted* if needed)
- Useful when mixing workloads in resource-constrained environments
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Setting the priority of a Pod
- Create a PriorityClass
(or use an existing one)
- When creating the Pod, set the field `spec.priorityClassName`
- If the field is not set:
- if there is a PriorityClass with `globalDefault`, it is used
- otherwise, the default priority will be zero
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## PriorityClass and ResourceQuotas
- A ResourceQuota can include a list of *scopes* or a *scope selector*
- In that case, the quota will only apply to the scoped resources
- Example: limit the resources allocated to "high priority" Pods
- In that case, make sure that the quota is created in every Namespace
(or use *admission configuration* to enforce it)
- See the [resource quotas documentation][quotadocs] for details
[quotadocs]: https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-priorityclass
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-limiting-resources-in-practice
class: title
Limiting resources in practice
.nav[
[Previous part](#toc-namespace-quotas)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-checking-node-and-pod-resource-usage)
]
.debug[(automatically generated title slide)]
---
# Limiting resources in practice
- We have at least three mechanisms:
- requests and limits per Pod
- LimitRange per namespace
- ResourceQuota per namespace
- Let's see one possible strategy to get started with resource limits
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Set a LimitRange
- In each namespace, create a LimitRange object
- Set a small default CPU request and CPU limit
(e.g. "100m")
- Set a default memory request and limit depending on your most common workload
- for Java, Ruby: start with "1G"
- for Go, Python, PHP, Node: start with "250M"
- Set upper bounds slightly below your expected node size
(80-90% of your node size, with at least a 500M memory buffer)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Set a ResourceQuota
- In each namespace, create a ResourceQuota object
- Set generous CPU and memory limits
(e.g. half the cluster size if the cluster hosts multiple apps)
- Set generous objects limits
- these limits should not be here to constrain your users
- they should catch a runaway process creating many resources
- example: a custom controller creating many pods
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Observe, refine, iterate
- Observe the resource usage of your pods
(we will see how in the next chapter)
- Adjust individual pod limits
- If you see trends: adjust the LimitRange
(rather than adjusting every individual set of pod limits)
- Observe the resource usage of your namespaces
(with `kubectl describe resourcequota ...`)
- Rinse and repeat regularly
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Underutilization
- Remember: when assigning a pod to a node, the scheduler looks at *requests*
(not at current utilization on the node)
- If pods request resources but don't use them, this can lead to underutilization
(because the scheduler will consider that the node is full and can't fit new pods)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Viewing a namespace limits and quotas
- `kubectl describe namespace` will display resource limits and quotas
.lab[
- Try it out:
```bash
kubectl describe namespace default
```
- View limits and quotas for *all* namespaces:
```bash
kubectl describe namespace
```
]
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Additional resources
- [A Practical Guide to Setting Kubernetes Requests and Limits](http://blog.kubecost.com/blog/requests-and-limits/)
- explains what requests and limits are
- provides guidelines to set requests and limits
- gives PromQL expressions to compute good values
(our app needs to be running for a while)
- [Kube Resource Report](https://codeberg.org/hjacobs/kube-resource-report)
- generates web reports on resource usage
- [nsinjector](https://github.com/blakelead/nsinjector)
- controller to automatically populate a Namespace when it is created
???
:EN:- Setting compute resource limits
:EN:- Defining default policies for resource usage
:EN:- Managing cluster allocation and quotas
:EN:- Resource management in practice
:FR:- Allouer et limiter les ressources des conteneurs
:FR:- Définir des ressources par défaut
:FR:- Gérer les quotas de ressources au niveau du cluster
:FR:- Conseils pratiques
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-checking-node-and-pod-resource-usage
class: title
Checking Node and Pod resource usage
.nav[
[Previous part](#toc-limiting-resources-in-practice)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-cluster-sizing)
]
.debug[(automatically generated title slide)]
---
# Checking Node and Pod resource usage
- We've installed a few things on our cluster so far
- How much resources (CPU, RAM) are we using?
- We need metrics!
.lab[
- Let's try the following command:
```bash
kubectl top nodes
```
]
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Is metrics-server installed?
- If we see a list of nodes, with CPU and RAM usage:
*great, metrics-server is installed!*
- If we see `error: Metrics API not available`:
*metrics-server isn't installed, so we'll install it!*
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## The resource metrics pipeline
- The `kubectl top` command relies on the Metrics API
- The Metrics API is part of the "[resource metrics pipeline]"
- The Metrics API isn't served (built into) the Kubernetes API server
- It is made available through the [aggregation layer]
- It is usually served by a component called metrics-server
- It is optional (Kubernetes can function without it)
- It is necessary for some features (like the Horizontal Pod Autoscaler)
[resource metrics pipeline]: https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
[aggregation layer]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Other ways to get metrics
- We could use a SAAS like Datadog, New Relic...
- We could use a self-hosted solution like Prometheus
- Or we could use metrics-server
- What's special about metrics-server?
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Pros/cons
Cons:
- no data retention (no history data, just instant numbers)
- only CPU and RAM of nodes and pods (no disk or network usage or I/O...)
Pros:
- very lightweight
- doesn't require storage
- used by Kubernetes autoscaling
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Why metrics-server
- We may install something fancier later
(think: Prometheus with Grafana)
- But metrics-server will work in *minutes*
- It will barely use resources on our cluster
- It's required for autoscaling anyway
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## How metric-server works
- It runs a single Pod
- That Pod will fetch metrics from all our Nodes
- It will expose them through the Kubernetes API aggregation layer
(we won't say much more about that aggregation layer; that's fairly advanced stuff!)
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Installing metrics-server
- In a lot of places, this is done with a little bit of custom YAML
(derived from the [official installation instructions](https://github.com/kubernetes-sigs/metrics-server#installation))
- We can also use a Helm chart:
```bash
helm upgrade --install metrics-server metrics-server \
--create-namespace --namespace metrics-server \
--repo https://kubernetes-sigs.github.io/metrics-server/ \
--set args={--kubelet-insecure-tls=true}
```
- The `args` flag specified above should be sufficient on most clusters
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
class: extra-details
## Kubelet insecure TLS?
- The metrics-server collects metrics by connecting to kubelet
- The connection is secured by TLS
- This requires a valid certificate
- In some cases, the certificate is self-signed
- In other cases, it might be valid, but include only the node name
(not its IP address, which is used by default by metrics-server)
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Testing metrics-server
- After a minute or two, metrics-server should be up
- We should now be able to check Nodes resource usage:
```bash
kubectl top nodes
```
- And Pods resource usage, too:
```bash
kubectl top pods --all-namespaces
```
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Keep some padding
- The RAM usage that we see should correspond more or less to the Resident Set Size
- Our pods also need some extra space for buffers, caches...
- Do not aim for 100% memory usage!
- Some more realistic targets:
50% (for workloads with disk I/O and leveraging caching)
90% (on very big nodes with mostly CPU-bound workloads)
75% (anywhere in between!)
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
## Other tools
- kube-capacity is a great CLI tool to view resources
(https://github.com/robscott/kube-capacity)
- It can show resource and limits, and compare them with usage
- It can show utilization per node, or per pod
- kube-resource-report can generate HTML reports
(https://codeberg.org/hjacobs/kube-resource-report)
???
:EN:- The resource metrics pipeline
:EN:- Installing metrics-server
:EN:- Le *resource metrics pipeline*
:FR:- Installation de metrics-server
.debug[[k8s/metrics-server.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/metrics-server.md)]
---
class: pic
.interstitial[]
---
name: toc-cluster-sizing
class: title
Cluster sizing
.nav[
[Previous part](#toc-checking-node-and-pod-resource-usage)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-disruptions)
]
.debug[(automatically generated title slide)]
---
# Cluster sizing
- What happens when the cluster gets full?
- How can we scale up the cluster?
- Can we do it automatically?
- What are other methods to address capacity planning?
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## When are we out of resources?
- kubelet monitors node resources:
- memory
- node disk usage (typically the root filesystem of the node)
- image disk usage (where container images and RW layers are stored)
- For each resource, we can provide two thresholds:
- a hard threshold (if it's met, it provokes immediate action)
- a soft threshold (provokes action only after a grace period)
- Resource thresholds and grace periods are configurable
(by passing kubelet command-line flags)
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## What happens then?
- If disk usage is too high:
- kubelet will try to remove terminated pods
- then, it will try to *evict* pods
- If memory usage is too high:
- it will try to evict pods
- The node is marked as "under pressure"
- This temporarily prevents new pods from being scheduled on the node
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## Which pods get evicted?
- kubelet looks at the pods' QoS and PriorityClass
- First, pods with BestEffort QoS are considered
- Then, pods with Burstable QoS exceeding their *requests*
(but only if the exceeding resource is the one that is low on the node)
- Finally, pods with Guaranteed QoS, and Burstable pods within their requests
- Within each group, pods are sorted by PriorityClass
- If there are pods with the same PriorityClass, they are sorted by usage excess
(i.e. the pods whose usage exceeds their requests the most are evicted first)
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
class: extra-details
## Eviction of Guaranteed pods
- *Normally*, pods with Guaranteed QoS should not be evicted
- A chunk of resources is reserved for node processes (like kubelet)
- It is expected that these processes won't use more than this reservation
- If they do use more resources anyway, all bets are off!
- If this happens, kubelet must evict Guaranteed pods to preserve node stability
(or Burstable pods that are still within their requested usage)
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## What happens to evicted pods?
- The pod is terminated
- It is marked as `Failed` at the API level
- If the pod was created by a controller, the controller will recreate it
- The pod will be recreated on another node, *if there are resources available!*
- For more details about the eviction process, see:
- [this documentation page](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/) about resource pressure and pod eviction,
- [this other documentation page](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) about pod priority and preemption.
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## What if there are no resources available?
- Sometimes, a pod cannot be scheduled anywhere:
- all the nodes are under pressure,
- or the pod requests more resources than are available
- The pod then remains in `Pending` state until the situation improves
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## Cluster scaling
- One way to improve the situation is to add new nodes
- This can be done automatically with the [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)
- The autoscaler will automatically scale up:
- if there are pods that failed to be scheduled
- The autoscaler will automatically scale down:
- if nodes have a low utilization for an extended period of time
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## Restrictions, gotchas ...
- The Cluster Autoscaler only supports a few cloud infrastructures
(see the [kubernetes/autoscaler repo][kubernetes-autoscaler-repo] for a list)
- The Cluster Autoscaler cannot scale down nodes that have pods using:
- local storage
- affinity/anti-affinity rules preventing them from being rescheduled
- a restrictive PodDisruptionBudget
[kubernetes-autoscaler-repo]: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
## Other way to do capacity planning
- "Running Kubernetes without nodes"
- Systems like [Virtual Kubelet](https://virtual-kubelet.io/) or [Kiyot](https://static.elotl.co/docs/latest/kiyot/kiyot.html) can run pods using on-demand resources
- Virtual Kubelet can leverage e.g. ACI or Fargate to run pods
- Kiyot runs pods in ad-hoc EC2 instances (1 instance per pod)
- Economic advantage (no wasted capacity)
- Security advantage (stronger isolation between pods)
Check [this blog post](http://jpetazzo.github.io/2019/02/13/running-kubernetes-without-nodes-with-kiyot/) for more details.
???
:EN:- What happens when the cluster is at, or over, capacity
:EN:- Cluster sizing and scaling
:FR:- Ce qui se passe quand il n'y a plus assez de ressources
:FR:- Dimensionner et redimensionner ses clusters
.debug[[k8s/cluster-sizing.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-sizing.md)]
---
class: pic
.interstitial[]
---
name: toc-disruptions
class: title
Disruptions
.nav[
[Previous part](#toc-cluster-sizing)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-cluster-autoscaler)
]
.debug[(automatically generated title slide)]
---
# Disruptions
In a perfect world...
- hardware never fails
- software never has bugs
- ...and never needs to be updated
- ...and uses a predictable amount of resources
- ...and these resources are infinite anyways
- network latency and packet loss are zero
- humans never make mistakes
--
😬
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Disruptions
In the real world...
- hardware will fail randomly (without advance notice)
- software has bugs
- ...and we constantly add new features
- ...and will sometimes use more resources than expected
- ...and these resources are limited
- network latency and packet loss are NOT zero
- humans make mistake (shutting down the wrong machine, the wrong app...)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Disruptions
- In Kubernetes, a "disruption" is something that stops the execution of a Pod
- There are **voluntary** and **involuntary** disruptions
- voluntary = directly initiated by humans (including by mistake!)
- involuntary = everything else
- In this section, we're going to see what they are and how to prevent them
(or at least, mitigate their effects)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node outage
- Example: hardware failure (server or network), low-level error
(includes kernel bugs, issues affecting underlying hypervisors or infrastructure...)
- **Involuntary** disruption (even if it results from human error!)
- Consequence: all workloads on that node become unresponsive
- Mitigations:
- scale workloads to at least 2 replicas (or more if quorum is needed)
- add anti-affinity scheduling constraints (to avoid having all pods on the same node)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node outage play-by-play
- Node goes down (or disconnected from network)
- Its lease (in Namespace `kube-node-lease`) doesn't get renewed
- Controller manager detects that and mark the node as "unreachable"
(this adds both a `NoSchedule` and `NoExecute` taints to the node)
- Eventually, the `NoExecute` taint will evict these pods
- This will trigger creation of replacement pods by owner controllers
(except for pods with a stable network identity, e.g. in a Stateful Set!)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node outage notes
- By default, pods will tolerate the `unreachable:NoExecute` taint for 5 minutes
(toleration automatically added by Admission controller `DefaultTolerationSeconds`)
- Pods of a Stateful Set don't recover automatically:
- as long as the Pod exists, a replacement Pod can't be created
- the Pod will exist as long as its Node exists
- deleting the Node (manually or automatically) will recover the Pod
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Memory/disk pressure
- Example: available memory on a node goes below a specific threshold
(because a pod is using too much memory and no limit was set)
- **Involuntary** disruption
- Consequence: kubelet starts to *evict* some pods
- Mitigations:
- set *resource limits* on containers to prevent them from using too much resources
- set *resource requests* on containers to make sure they don't get evicted
(as long as they use less than what they requested)
- make sure that apps don't use more resources than what they've requested
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Memory/disk pressure play-by-play
- Memory leak in an application container, slowly causing very high memory usage
- Overall free memory on the node goes below the *soft* or the *hard* threshold
(default hard threshold = 100Mi; default soft threshold = none)
- When reaching the *soft* threshold:
- kubelet waits until the "eviction soft grace period" expires
- then (if resource usage is still above the threshold) it gracefully evicts pods
- When reaching the *hard* threshold:
- kubelet immediately and forcefully evicts pods
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Which pods are evicted?
- Kubelet only considers pods that are using *more* than what they requested
(and only for the resource that is under pressure, e.g. RAM or disk usage)
- First, it sorts pods by *priority¹* (as set with the `priorityClassName` in the pod spec)
- Then, by how much their resource usage exceeds their request
(again, for the resource that is under pressure)
- It evicts pods until enough resources have been freed up
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Soft (graceful) vs hard (forceful) eviction
- Soft eviction = graceful shutdown of the pod
(honor's the pod `terminationGracePeriodSeconds` timeout)
- Hard eviction = immediate shutdown of the pod
(kills all containers immediately)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Memory/disk pressure notes
- If resource usage increases *very fast*, kubelet might not catch it fast enough
- For memory: this will trigger the kernel out-of-memory killer
- containers killed by OOM are automatically restarted (no eviction)
- eviction might happen at a later point though (if memory usage stays high)
- For disk: there is no "out-of-disk" killer, but writes will fail
- the `write` system call fails with `errno = ENOSPC` / `No space left on device`
- eviction typically happens shortly after (when kubelet catches up)
- When relying on disk/memory bursts a lot, using `priorityClasses` might help
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Memory/disk pressure delays
- By default, no soft threshold is defined
- Defining it requires setting both the threshold and the grace period
- Grace periods can be different for the different types of resources
- When a node is under pressure, kubelet places a `NoSchedule` taint
(to avoid adding more pods while the pod is under pressure)
- Once the node is no longer under pressure, kubelet clears the taint
(after waiting an extra timeout, `evictionPressureTransitionPeriod`, 5 min by default)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Accidental deletion
- Example: developer deletes the wrong Deployment, the wrong Namespace...
- **Voluntary** disruption
(from Kubernetes' perspective!)
- Consequence: application is down
- Mitigations:
- only deploy to production systems through e.g. gitops workflows
- enforce peer review of changes
- only give users limited (e.g. read-only) access to production systems
- use canary deployments (might not catch all mistakes though!)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Bad code deployment
- Example: critical bug introduced, application crashes immediately or is non-functional
- **Voluntary** disruption
(again, from Kubernetes' perspective!)
- Consequence: application is down
- Mitigations:
- readiness probes can mitigate immediate crashes
(rolling update continues only when enough pods are ready)
- delayed crashes will require a rollback
(manual intervention, or automated by a canary system)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node shutdown
- Example: scaling down a cluster to save money
- **Voluntary** disruption
- Consequence:
- all workloads running on that node are terminated
- this might disrupt workloads that have too many replicas on that node
- or workloads that should not be interrupted at all
- Mitigations:
- terminate workloads one at a time, coordinating with users
--
🤔
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node shutdown
- Example: scaling down a cluster to save money
- **Voluntary** disruption
- Consequence:
- all workloads running on that node are terminated
- this might disrupt workloads that have too many replicas on that node
- or workloads that should not be interrupted at all
- Mitigations:
- ~~terminate workloads one at a time, coordinating with users~~
- use Pod Disruption Budgets
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Pod Disruption Budgets
- A PDB is a kind of *contract* between:
- "admins" = folks maintaining the cluster (e.g. adding/removing/updating nodes)
- "users" = folks deploying apps and workloads on the cluster
- A PDB expresses something like:
*in that particular set of pods, do not "disrupt" more than X at a time*
- Examples:
- in that set of frontend pods, do not disrupt more than 1 at a time
- in that set of worker pods, always have at least 10 ready
(do not disrupt them if it would bring down the number of ready pods below 10)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## PDB - user side
- Cluster users create a PDB with a manifest like this one:
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
#minAvailable: 2
#minAvailable: 90%
maxUnavailable: 1
#maxUnavailable: 10%
selector:
matchLabels:
app: my-app
```
- The PDB must indicate either `minAvailable` or `maxUnavailable`
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Rounding logic
- Percentages are rounded **up**
- When specifying `maxUnavailble` as a percentage, this can result in a higher perecentage
(e.g. `maxUnavailable: 50%` with 3 pods can result in 2 pods being unavailable!)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Unmanaged pods
- Specifying `minAvailable: X` works all the time
- Specifying `minAvailable: X%` or `maxUnavaiable` requires *managed pods*
(pods that belong to a controller, e.g. Replica Set, Stateful Set...)
- This is because the PDB controller needs to know the total number of pods
(given by the `replicas` field, not merely by counting pod objects)
- The PDB controller will try to resolve the controller using the pod selector
- If that fails, the PDB controller will emit warning events
(visible with `kubectl describe pdb ...`)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Zero
- `maxUnavailable: 0` means "do not disrupt my pods"
- Same thing if `minAvailable` is greater than or equal to the number of pods
- In that case, cluster admins are supposed to get in touch with cluster users
- This will prevent fully automated operation
(and some cluster admins automated systems might not honor that request)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## PDB - admin side
- As a cluster admin, we need to follow certain rules
- Only shut down (or restart) a node when no pods are running on that node
(except system pods belonging to Daemon Sets)
- To remove pods running on a node, we should use the *eviction API*
(which will check PDB constraints and honor them)
- To prevent new pods from being scheduled on a node, we can use a *taint*
- These operations are streamlined by `kubectl drain`, which will:
- *cordon* the node (add a `NoSchedule` taint)
- invoke the *eviction API* to remove pods while respecting their PDBs
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Theory vs practice
- `kubectl drain` won't evict pods using `emptyDir` volumes
(unless the `--delete-emptydir-data` flag is passed as well)
- Make sure that `emptyDir` volumes don't hold anything important
(they shouldn't, but... who knows!)
- Kubernetes lacks a standard way for users to express:
*this `emptyDir` volume can/cannot be safely deleted*
- If a PDB forbids an eviction, this requires manual coordination
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
class: extra-details
## Unhealthy pod eviction policy
- By default, unhealthy pods can only be evicted if PDB allows it
(unhealthy = running, but not ready)
- In many cases, unhealthy pods aren't healthy anyway, and can be removed
- This behavior is enabled by setting the appropriate field in the PDB manifest:
```yaml
spec:
unhealthyPodEvictionPolicy: AlwaysAllow
```
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node upgrade
- Example: upgrading kubelet or the Linux kernel on a node
- **Voluntary** disruption
- Consequence:
- all workloads running on that node are temporarily interrupted, and restarted
- this might disrupt these workloads
- Mitigations:
- migrate workloads off the done first (as if we were shutting it down)
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Node upgrade notes
- Is it necessary to drain a node before doing an upgrade?
- From [the documentation][node-upgrade-docs]:
*Draining nodes before upgrading kubelet ensures that pods are re-admitted and containers are re-created, which may be necessary to resolve some security issues or other important bugs.*
- It's *probably* safe to upgrade in-place for:
- kernel upgrades
- kubelet patch-level upgrades (1.X.Y → 1.X.Z)
- It's *probably* better to drain the node for minor revisions kubelet upgrades (1.X → 1.Y)
- In doubt, test extensively in staging environments!
[node-upgrade-docs]: https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/#manual-deployments
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
## Manual rescheduling
- Example: moving workloads around to accommodate noisy neighbors or other issues
(e.g. pod X is doing a lot of disk I/O and this is starving other pods)
- **Voluntary** disruption
- Consequence:
- the moved workloads are temporarily interrupted
- Mitigations:
- define an appropriate number of replicas, declare PDBs
- use the [eviction API][eviction-API] to move workloads
[eviction-API]: https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/
???
:EN:- Voluntary and involuntary disruptions
:EN:- Pod Disruption Budgets
:FR:- "Disruptions" volontaires et involontaires
:FR:- Pod Disruption Budgets
.debug[[k8s/disruptions.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/disruptions.md)]
---
class: pic
.interstitial[]
---
name: toc-cluster-autoscaler
class: title
Cluster autoscaler
.nav[
[Previous part](#toc-disruptions)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-the-horizontal-pod-autoscaler)
]
.debug[(automatically generated title slide)]
---
# Cluster autoscaler
- When the cluster is full, we need to add more nodes
- This can be done manually:
- deploy new machines and add them to the cluster
- if using managed Kubernetes, use some API/CLI/UI
- Or automatically with the cluster autoscaler:
https://github.com/kubernetes/autoscaler
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Use-cases
- Batch job processing
"once in a while, we need to execute these 1000 jobs in parallel"
"...but the rest of the time there is almost nothing running on the cluster"
- Dynamic workload
"a few hours per day or a few days per week, we have a lot of traffic"
"...but the rest of the time, the load is much lower"
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Pay for what you use
- The point of the cloud is to "pay for what you use"
- If you have a fixed number of cloud instances running at all times:
*you're doing in wrong (except if your load is always the same)*
- If you're not using some kind of autoscaling, you're wasting money
(except if you like lining the pockets of your cloud provider)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Running the cluster autoscaler
- We must run nodes on a supported infrastructure
- Check the [GitHub repo][autoscaler-providers] for a non-exhaustive list of supported providers
- Sometimes, the cluster autoscaler is installed automatically
(or by setting a flag / checking a box when creating the cluster)
- Sometimes, it requires additional work
(which is often non-trivial and highly provider-specific)
[autoscaler-providers]: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Scaling up in theory
IF a Pod is `Pending`,
AND adding a Node would allow this Pod to be scheduled,
THEN add a Node.
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Fine print 1
*IF a Pod is `Pending`...*
- First of all, the Pod must exist
- Pod creation might be blocked by e.g. a namespace quota
- In that case, the cluster autoscaler will never trigger
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Fine print 2
*IF a Pod is `Pending`...*
- If our Pods do not have resource requests:
*they will be in the `BestEffort` class*
- Generally, Pods in the `BestEffort` class are schedulable
- except if they have anti-affinity placement constraints
- except if all Nodes already run the max number of pods (110 by default)
- Therefore, if we want to leverage cluster autoscaling:
*our Pods should have resource requests*
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Fine print 3
*AND adding a Node would allow this Pod to be scheduled...*
- The autoscaler won't act if:
- the Pod is too big to fit on a single Node
- the Pod has impossible placement constraints
- Examples:
- "run one Pod per datacenter" with 4 pods and 3 datacenters
- "use this nodeSelector" but no such Node exists
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Trying it out
- We're going to check how much capacity is available on the cluster
- Then we will create a basic deployment
- We will add resource requests to that deployment
- Then scale the deployment to exceed the available capacity
- **The following commands require a working cluster autoscaler!**
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Checking available resources
.lab[
- Check how much CPU is allocatable on the cluster:
```bash
kubectl get nodes -o jsonpath={..allocatable.cpu}
```
]
- If we see e.g. `2800m 2800m 2800m`, that means:
3 nodes with 2.8 CPUs allocatable each
- To trigger autoscaling, we will create 7 pods requesting 1 CPU each
(each node can fit 2 such pods)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Creating our test Deployment
.lab[
- Create the Deployment:
```bash
kubectl create deployment blue --image=jpetazzo/color
```
- Add a request for 1 CPU:
```bash
kubectl patch deployment blue --patch='
spec:
template:
spec:
containers:
- name: color
resources:
requests:
cpu: 1
'
```
]
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Scaling up in practice
- This assumes that we have strictly less than 7 CPUs available
(adjust the numbers if necessary!)
.lab[
- Scale up the Deployment:
```bash
kubectl scale deployment blue --replicas=7
```
- Check that we have a new Pod, and that it's `Pending`:
```bash
kubectl get pods
```
]
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Cluster autoscaling
- After a few minutes, a new Node should appear
- When that Node becomes `Ready`, the Pod will be assigned to it
- The Pod will then be `Running`
- Reminder: the `AGE` of the Pod indicates when the Pod was *created*
(it doesn't indicate when the Pod was scheduled or started!)
- To see other state transitions, check the `status.conditions` of the Pod
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Scaling down in theory
IF a Node has less than 50% utilization for 10 minutes,
AND all its Pods can be scheduled on other Nodes,
AND all its Pods are *evictable*,
AND the Node doesn't have a "don't scale me down" annotation¹,
THEN drain the Node and shut it down.
.footnote[¹The annotation is: `cluster-autoscaler.kubernetes.io/scale-down-disabled=true`]
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## When is a Pod "evictable"?
By default, Pods are evictable, except if any of the following is true.
- They have a restrictive Pod Disruption Budget
- They are "standalone" (not controlled by a ReplicaSet/Deployment, StatefulSet, Job...)
- They are in `kube-system` and don't have a Pod Disruption Budget
- They have local storage (that includes `EmptyDir`!)
This can be overridden by setting the annotation:
`cluster-autoscaler.kubernetes.io/safe-to-evict`
(it can be set to `true` or `false`)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Pod Disruption Budget
- Special resource to configure how many Pods can be *disrupted*
(i.e. shutdown/terminated)
- Applies to Pods matching a given selector
(typically matching the selector of a Deployment)
- Only applies to *voluntary disruption*
(e.g. cluster autoscaler draining a node, planned maintenance...)
- Can express `minAvailable` or `maxUnavailable`
- See [documentation][doc-pdb] for details and examples
[doc-pdb]: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Local storage
- If our Pods use local storage, they will prevent scaling down
- If we have e.g. an `EmptyDir` volume for caching/sharing:
make sure to set the `.../safe-to-evict` annotation to `true`!
- Even if the volume...
- ...only has a PID file or UNIX socket
- ...is empty
- ...is not mounted by any container in the Pod!
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Expensive batch jobs
- Careful if we have long-running batch jobs!
(e.g. jobs that take many hours/days to complete)
- These jobs could get evicted before they complete
(especially if they use less than 50% of the allocatable resources)
- Make sure to set the `.../safe-to-evict` annotation to `false`!
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Node groups
- Easy scenario: all nodes have the same size
- Realistic scenario: we have nodes of different sizes
- e.g. mix of CPU and GPU nodes
- e.g. small nodes for control plane, big nodes for batch jobs
- e.g. leveraging spot capacity
- The cluster autoscaler can handle it!
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
class: extra-details
## Leveraging spot capacity
- AWS, Azure, and Google Cloud are typically more expensive then their competitors
- However, they offer *spot* capacity (spot instances, spot VMs...)
- *Spot* capacity:
- has a much lower cost (see e.g. AWS [spot instance advisor][awsspot])
- has a cost that varies continuously depending on regions, instance type...
- can be preempted at all times
- To be cost-effective, it is strongly recommended to leverage spot capacity
[awsspot]: https://aws.amazon.com/ec2/spot/instance-advisor/
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Node groups in practice
- The cluster autoscaler maps nodes to *node groups*
- this is an internal, provider-dependent mechanism
- the node group is sometimes visible through a proprietary label or annotation
- Each node group is scaled independently
- The cluster autoscaler uses [expanders] to decide which node group to scale up
(the default expander is "random", i.e. pick a node group at random!)
- Of course, only acceptable node groups will be considered
(i.e. node groups that could accommodate the `Pending` Pods)
[expanders]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
class: extra-details
## Scaling to zero
- *In general,* a node group needs to have at least one node at all times
(the cluster autoscaler uses that node to figure out the size, labels, taints... of the group)
- *On some providers,* there are special ways to specify labels and/or taints
(but if you want to scale to zero, check that the provider supports it!)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Warning
- Autoscaling up is easy
- Autoscaling down is harder
- It might get stuck because Pods are not evictable
- Do at least a dry run to make sure that the cluster scales down correctly!
- Have alerts on cloud spend
- *Especially when using big/expensive nodes (e.g. with GPU!)*
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Preferred vs. Required
- Some Kubernetes mechanisms allow to express "soft preferences":
- affinity (`requiredDuringSchedulingIgnoredDuringExecution` vs `preferredDuringSchedulingIgnoredDuringExecution`)
- taints (`NoSchedule`/`NoExecute` vs `PreferNoSchedule`)
- Remember that these "soft preferences" can be ignored
(and given enough time and churn on the cluster, they will!)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Troubleshooting
- The cluster autoscaler publishes its status on a ConfigMap
.lab[
- Check the cluster autoscaler status:
```bash
kubectl describe configmap --namespace kube-system cluster-autoscaler-status
```
]
- We can also check the logs of the autoscaler
(except on managed clusters where it's running internally, not visible to us)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Acknowledgements
Special thanks to [@s0ulshake] for their help with this section!
If you need help to run your data science workloads on Kubernetes,
they're available for consulting.
(Get in touch with them through https://www.linkedin.com/in/ajbowen/)
[@s0ulshake]: https://twitter.com/s0ulshake
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
class: pic
.interstitial[]
---
name: toc-the-horizontal-pod-autoscaler
class: title
The Horizontal Pod Autoscaler
.nav[
[Previous part](#toc-cluster-autoscaler)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-scaling-with-custom-metrics)
]
.debug[(automatically generated title slide)]
---
# The Horizontal Pod Autoscaler
- What is the Horizontal Pod Autoscaler, or HPA?
- It is a controller that can perform *horizontal* scaling automatically
- Horizontal scaling = changing the number of replicas
(adding/removing pods)
- Vertical scaling = changing the size of individual replicas
(increasing/reducing CPU and RAM per pod)
- Cluster scaling = changing the size of the cluster
(adding/removing nodes)
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Principle of operation
- Each HPA resource (or "policy") specifies:
- which object to monitor and scale (e.g. a Deployment, ReplicaSet...)
- min/max scaling ranges (the max is a safety limit!)
- a target resource usage (e.g. the default is CPU=80%)
- The HPA continuously monitors the CPU usage for the related object
- It computes how many pods should be running:
`TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)`
- It scales the related object up/down to this target number of pods
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Pre-requirements
- The metrics server needs to be running
(i.e. we need to be able to see pod metrics with `kubectl top pods`)
- The pods that we want to autoscale need to have resource requests
(because the target CPU% is not absolute, but relative to the request)
- The latter actually makes a lot of sense:
- if a Pod doesn't have a CPU request, it might be using 10% of CPU...
- ...but only because there is no CPU time available!
- this makes sure that we won't add pods to nodes that are already resource-starved
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Testing the HPA
- We will start a CPU-intensive web service
- We will send some traffic to that service
- We will create an HPA policy
- The HPA will automatically scale up the service for us
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## A CPU-intensive web service
- Let's use `jpetazzo/busyhttp`
(it is a web server that will use 1s of CPU for each HTTP request)
.lab[
- Deploy the web server:
```bash
kubectl create deployment busyhttp --image=jpetazzo/busyhttp
```
- Expose it with a ClusterIP service:
```bash
kubectl expose deployment busyhttp --port=80
```
- Get the ClusterIP allocated to the service:
```bash
kubectl get svc busyhttp
```
]
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Monitor what's going on
- Let's start a bunch of commands to watch what is happening
.lab[
- Monitor pod CPU usage:
```bash
watch kubectl top pods -l app=busyhttp
```
- Monitor service latency:
```bash
httping http://`$CLUSTERIP`/
```
- Monitor cluster events:
```bash
kubectl get events -w
```
]
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Send traffic to the service
- We will use `ab` (Apache Bench) to send traffic
.lab[
- Send a lot of requests to the service, with a concurrency level of 3:
```bash
ab -c 3 -n 100000 http://`$CLUSTERIP`/
```
]
The latency (reported by `httping`) should increase above 3s.
The CPU utilization should increase to 100%.
(The server is single-threaded and won't go above 100%.)
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Create an HPA policy
- There is a helper command to do that for us: `kubectl autoscale`
.lab[
- Create the HPA policy for the `busyhttp` deployment:
```bash
kubectl autoscale deployment busyhttp --max=10
```
]
By default, it will assume a target of 80% CPU usage.
This can also be set with `--cpu-percent=`.
--
*The autoscaler doesn't seem to work. Why?*
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## What did we miss?
- The events stream gives us a hint, but to be honest, it's not very clear:
`missing request for cpu`
- We forgot to specify a resource request for our Deployment!
- The HPA target is not an absolute CPU%
- It is relative to the CPU requested by the pod
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Adding a CPU request
- Let's edit the deployment and add a CPU request
- Since our server can use up to 1 core, let's request 1 core
.lab[
- Edit the Deployment definition:
```bash
kubectl edit deployment busyhttp
```
- In the `containers` list, add the following block:
```yaml
resources:
requests:
cpu: "1"
```
]
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Results
- After saving and quitting, a rolling update happens
(if `ab` or `httping` exits, make sure to restart it)
- It will take a minute or two for the HPA to kick in:
- the HPA runs every 30 seconds by default
- it needs to gather metrics from the metrics server first
- If we scale further up (or down), the HPA will react after a few minutes:
- it won't scale up if it already scaled in the last 3 minutes
- it won't scale down if it already scaled in the last 5 minutes
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## What about other metrics?
- The HPA in API group `autoscaling/v1` only supports CPU scaling
- The HPA in API group `autoscaling/v2beta2` supports metrics from various API groups:
- metrics.k8s.io, aka metrics server (per-Pod CPU and RAM)
- custom.metrics.k8s.io, custom metrics per Pod
- external.metrics.k8s.io, external metrics (not associated to Pods)
- Kubernetes doesn't implement any of these API groups
- Using these metrics requires [registering additional APIs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-apis)
- The metrics provided by metrics server are standard; everything else is custom
- For more details, see [this great blog post](https://medium.com/uptime-99/kubernetes-hpa-autoscaling-with-custom-and-external-metrics-da7f41ff7846) or [this talk](https://www.youtube.com/watch?v=gSiGFH4ZnS8)
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
## Cleanup
- Since `busyhttp` uses CPU cycles, let's stop it before moving on
.lab[
- Delete the `busyhttp` Deployment:
```bash
kubectl delete deployment busyhttp
```
]
???
:EN:- Auto-scaling resources
:FR:- *Auto-scaling* (dimensionnement automatique) des ressources
.debug[[k8s/horizontal-pod-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/horizontal-pod-autoscaler.md)]
---
class: pic
.interstitial[]
---
name: toc-scaling-with-custom-metrics
class: title
Scaling with custom metrics
.nav[
[Previous part](#toc-the-horizontal-pod-autoscaler)
|
[Back to table of contents](#toc-part-12)
|
[Next part](#toc-extending-the-kubernetes-api)
]
.debug[(automatically generated title slide)]
---
# Scaling with custom metrics
- The HorizontalPodAutoscaler v1 can only scale on Pod CPU usage
- Sometimes, we need to scale using other metrics:
- memory
- requests per second
- latency
- active sessions
- items in a work queue
- ...
- The HorizontalPodAutoscaler v2 can do it!
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Requirements
⚠️ Autoscaling on custom metrics is fairly complex!
- We need some metrics system
(Prometheus is a popular option, but others are possible too)
- We need our metrics (latency, traffic...) to be fed in the system
(with Prometheus, this might require a custom exporter)
- We need to expose these metrics to Kubernetes
(Kubernetes doesn't "speak" the Prometheus API)
- Then we can set up autoscaling!
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## The plan
- We will deploy the DockerCoins demo app
(one of its components has a bottleneck; its latency will increase under load)
- We will use Prometheus to collect and store metrics
- We will deploy a tiny HTTP latency monitor (a Prometheus *exporter*)
- We will deploy the "Prometheus adapter"
(mapping Prometheus metrics to Kubernetes-compatible metrics)
- We will create an HorizontalPodAutoscaler 🎉
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Deploying DockerCoins
- That's the easy part!
.lab[
- Create a new namespace and switch to it:
```bash
kubectl create namespace customscaling
kns customscaling
```
- Deploy DockerCoins, and scale up the `worker` Deployment:
```bash
kubectl apply -f ~/container.training/k8s/dockercoins.yaml
kubectl scale deployment worker --replicas=10
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Current state of affairs
- The `rng` service is a bottleneck
(it cannot handle more than 10 requests/second)
- With enough traffic, its latency increases
(by about 100ms per `worker` Pod after the 3rd worker)
.lab[
- Check the `webui` port and open it in your browser:
```bash
kubectl get service webui
```
- Check the `rng` ClusterIP and test it with e.g. `httping`:
```bash
kubectl get service rng
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Measuring latency
- We will use a tiny custom Prometheus exporter, [httplat](https://github.com/jpetazzo/httplat)
- `httplat` exposes Prometheus metrics on port 9080 (by default)
- It monitors exactly one URL, that must be passed as a command-line argument
.lab[
- Deploy `httplat`:
```bash
kubectl create deployment httplat --image=jpetazzo/httplat -- httplat http://rng/
```
- Expose it:
```bash
kubectl expose deployment httplat --port=9080
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
class: extra-details
## Measuring latency in the real world
- We are using this tiny custom exporter for simplicity
- A more common method to collect latency is to use a service mesh
- A service mesh can usually collect latency for *all* services automatically
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Install Prometheus
- We will use the Prometheus community Helm chart
(because we can configure it dynamically with annotations)
.lab[
- If it's not installed yet on the cluster, install Prometheus:
```bash
helm upgrade --install prometheus prometheus \
--repo https://prometheus-community.github.io/helm-charts \
--namespace prometheus --create-namespace \
--set server.service.type=NodePort \
--set server.service.nodePort=30090 \
--set server.persistentVolume.enabled=false \
--set alertmanager.enabled=false
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Configure Prometheus
- We can use annotations to tell Prometheus to collect the metrics
.lab[
- Tell Prometheus to "scrape" our latency exporter:
```bash
kubectl annotate service httplat \
prometheus.io/scrape=true \
prometheus.io/port=9080 \
prometheus.io/path=/metrics
```
]
If you deployed Prometheus differently, you might have to configure it manually.
You'll need to instruct it to scrape http://httplat.customscaling.svc:9080/metrics.
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Make sure that metrics get collected
- Before moving on, confirm that Prometheus has our metrics
.lab[
- Connect to Prometheus
(if you installed it like instructed above, it is exposed as a NodePort on port 30090)
- Check that `httplat` metrics are available
- You can try to graph the following PromQL expression:
```
rate(httplat_latency_seconds_sum[2m])/rate(httplat_latency_seconds_count[2m])
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Troubleshooting
- Make sure that the exporter works:
- get the ClusterIP of the exporter with `kubectl get svc httplat`
- `curl http://:9080/metrics`
- check that the result includes the `httplat` histogram
- Make sure that Prometheus is scraping the exporter:
- go to `Status` / `Targets` in Prometheus
- make sure that `httplat` shows up in there
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Creating the autoscaling policy
- We need custom YAML (we can't use the `kubectl autoscale` command)
- It must specify `scaleTargetRef`, the resource to scale
- any resource with a `scale` sub-resource will do
- this includes Deployment, ReplicaSet, StatefulSet...
- It must specify one or more `metrics` to look at
- if multiple metrics are given, the autoscaler will "do the math" for each one
- it will then keep the largest result
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Details about the `metrics` list
- Each item will look like this:
```yaml
- type:
:
metric:
name:
<...optional selector (mandatory for External metrics)...>
target:
type:
:
```
`` can be `Resource`, `Pods`, `Object`, or `External`.
`` can be `Utilization`, `Value`, or `AverageValue`.
Let's explain the 4 different `` values!
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## `Resource`
Use "classic" metrics served by `metrics-server` (`cpu` and `memory`).
```yaml
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
```
Compute average *utilization* (usage/requests) across pods.
It's also possible to specify `Value` or `AverageValue` instead of `Utilization`.
(To scale according to "raw" CPU or memory usage.)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## `Pods`
Use custom metrics. These are still "per-Pod" metrics.
```yaml
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
```
`type:` *must* be `AverageValue`.
(It cannot be `Utilization`, since these can't be used in Pod `requests`.)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## `Object`
Use custom metrics. These metrics are "linked" to any arbitrary resource.
(E.g. a Deployment, Service, Ingress, ...)
```yaml
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: main-route
target:
type: AverageValue
value: 100
```
`type:` can be `Value` or `AverageValue` (see next slide for details).
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## `Value` vs `AverageValue`
- `Value`
- use the value as-is
- useful to pace a client or producer
- "target a specific total load on a specific endpoint or queue"
- `AverageValue`
- divide the value by the number of pods
- useful to scale a server or consumer
- "scale our systems to meet a given SLA/SLO"
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## `External`
Use arbitrary metrics. The series to use is specified with a label selector.
```yaml
- type: External
external:
metric:
name: queue_messages_ready
selector: "queue=worker_tasks"
target:
type: AverageValue
averageValue: 30
```
The `selector` will be passed along when querying the metrics API.
Its meaninng is implementation-dependent.
It may or may not correspond to Kubernetes labels.
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## One more thing ...
- We can give a `behavior` set of options
- Indicates:
- how much to scale up/down in a single step
- a *stabilization window* to avoid hysteresis effects
- The default stabilization window is 15 seconds for `scaleUp`
(we might want to change that!)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
Putting togeher [k8s/hpa-v2-pa-httplat.yaml](https://github.com/jpetazzo/container.training/tree/master/k8s/hpa-v2-pa-httplat.yaml):
.small[
```yaml
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
metadata:
name: rng
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rng
minReplicas: 1
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 180
metrics:
- type: Object
object:
describedObject:
apiVersion: v1
kind: Service
name: httplat
metric:
name: httplat_latency_seconds
target:
type: Value
value: 0.1
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Creating the autoscaling policy
- We will register the policy
- Of course, it won't quite work yet (we're missing the *Prometheus adapter*)
.lab[
- Create the HorizontalPodAutoscaler:
```bash
kubectl apply -f ~/container.training/k8s/hpa-v2-pa-httplat.yaml
```
- Check the logs of the `controller-manager`:
```bash
stern --namespace=kube-system --tail=10 controller-manager
```
]
After a little while we should see messages like this:
```
no custom metrics API (custom.metrics.k8s.io) registered
```
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## `custom.metrics.k8s.io`
- The HorizontalPodAutoscaler will get the metrics *from the Kubernetes API itself*
- In our specific case, it will access a resource like this one:
.small[
```
/apis/custom.metrics.k8s.io/v1beta1/namespaces/customscaling/services/httplat/httplat_latency_seconds
```
]
- By default, the Kubernetes API server doesn't implement `custom.metrics.k8s.io`
(we can have a look at `kubectl get apiservices`)
- We need to:
- start an API service implementing this API group
- register it with our API server
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## The Prometheus adapter
- The Prometheus adapter is an open source project:
https://github.com/DirectXMan12/k8s-prometheus-adapter
- It's a Kubernetes API service implementing API group `custom.metrics.k8s.io`
- It maps the requests it receives to Prometheus metrics
- Exactly what we need!
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Deploying the Prometheus adapter
- There is ~~an app~~ a Helm chart for that
.lab[
- Install the Prometheus adapter:
```bash
helm upgrade --install prometheus-adapter prometheus-adapter \
--repo https://prometheus-community.github.io/helm-charts \
--namespace=prometheus-adapter --create-namespace \
--set prometheus.url=http://prometheus-server.prometheus.svc \
--set prometheus.port=80
```
]
- It comes with some default mappings
- But we will need to add `httplat` to these mappings
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Configuring the Prometheus adapter
- The Prometheus adapter can be configured/customized through a ConfigMap
- We are going to edit that ConfigMap, then restart the adapter
- We need to add a rule that will say:
- all the metrics series named `httplat_latency_seconds_sum` ...
- ... belong to *Services* ...
- ... the name of the Service and its Namespace are indicated by the `kubernetes_name` and `kubernetes_namespace` Prometheus tags respectively ...
- ... and the exact value to use should be the following PromQL expression
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## The mapping rule
Here is the rule that we need to add to the configuration:
```yaml
- seriesQuery: 'httplat_latency_seconds_sum{namespace!="",service!=""}'
resources:
overrides:
namespace:
resource: namespace
service:
resource: service
name:
matches: "httplat_latency_seconds_sum"
as: "httplat_latency_seconds"
metricsQuery: |
rate(httplat_latency_seconds_sum{<<.LabelMatchers>>}[2m])/rate(httplat_latency_seconds_count{<<.LabelMatchers>>}[2m])
```
(I built it following the [walkthrough](https://github.com/DirectXMan12/k8s-prometheus-adapter/blob/master/docs/config-walkthrough.md
) in the Prometheus adapter documentation.)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Editing the adapter's configuration
.lab[
- Edit the adapter's ConfigMap:
```bash
kubectl edit configmap prometheus-adapter --namespace=prometheus-adapter
```
- Add the new rule in the `rules` section, at the end of the configuration file
- Save, quit
- Restart the Prometheus adapter:
```bash
kubectl rollout restart deployment --namespace=prometheus-adapter prometheus-adapter
```
]
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Witness the marvel of custom autoscaling
(Sort of)
- After a short while, the `rng` Deployment will scale up
- It should scale up until the latency drops below 100ms
(and continue to scale up a little bit more after that)
- Then, since the latency will be well below 100ms, it will scale down
- ... and back up again, etc.
(See pictures on next slides!)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
class: pic

.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
class: pic

.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## What's going on?
- The autoscaler's information is slightly out of date
(not by much; probably between 1 and 2 minute)
- It's enough to cause the oscillations to happen
- One possible fix is to tell the autoscaler to wait a bit after each action
- It will reduce oscillations, but will also slow down its reaction time
(and therefore, how fast it reacts to a peak of traffic)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## What's going on? Take 2
- As soon as the measured latency is *significantly* below our target (100ms) ...
the autoscaler tries to scale down
- If the latency is measured at 20ms ...
the autoscaler will try to *divide the number of pods by five!*
- One possible solution: apply a formula to the measured latency,
so that values between e.g. 10 and 100ms get very close to 100ms.
- Another solution: instead of targetting for a specific latency,
target a 95th percentile latency or something similar, using
a more advanced PromQL expression (and leveraging the fact that
we have histograms instead of raw values).
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Troubleshooting
Check that the adapter registered itself correctly:
```bash
kubectl get apiservices | grep metrics
```
Check that the adapter correctly serves metrics:
```bash
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1
```
Check that our `httplat` metrics are available:
```bash
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1\
/namespaces/customscaling/services/httplat/httplat_latency_seconds
```
Also check the logs of the `prometheus-adapter` and the `kube-controller-manager`.
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Useful links
- [Horizontal Pod Autoscaler walkthrough](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/) in the Kubernetes documentation
- [Autoscaling design proposal](https://github.com/kubernetes/community/tree/master/contributors/design-proposals/autoscaling)
- [Kubernetes custom metrics API alternative implementations](https://github.com/kubernetes/metrics/blob/master/IMPLEMENTATIONS.md)
- [Prometheus adapter configuration walkthrough](https://github.com/DirectXMan12/k8s-prometheus-adapter/blob/master/docs/config-walkthrough.md)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
## Discussion
- This system works great if we have a single, centralized metrics system
(and the corresponding "adapter" to expose these metrics through the Kubernetes API)
- If we have metrics in multiple places, we must aggregate them
(good news: Prometheus has exporters for almost everything!)
- It is complex and has a steep learning curve
- Another approach is [KEDA](https://keda.sh/)
???
:EN:- Autoscaling with custom metrics
:FR:- Suivi de charge avancé (HPAv2)
.debug[[k8s/hpa-v2.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/hpa-v2.md)]
---
class: pic
.interstitial[]
---
name: toc-extending-the-kubernetes-api
class: title
Extending the Kubernetes API
.nav[
[Previous part](#toc-scaling-with-custom-metrics)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-api-server-internals)
]
.debug[(automatically generated title slide)]
---
# Extending the Kubernetes API
There are multiple ways to extend the Kubernetes API.
We are going to cover:
- Controllers
- Dynamic Admission Webhooks
- Custom Resource Definitions (CRDs)
- The Aggregation Layer
But first, let's re(re)visit the API server ...
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Revisiting the API server
- The Kubernetes API server is a central point of the control plane
- Everything connects to the API server:
- users (that's us, but also automation like CI/CD)
- kubelets
- network components (e.g. `kube-proxy`, pod network, NPC)
- controllers; lots of controllers
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Some controllers
- `kube-controller-manager` runs built-on controllers
(watching Deployments, Nodes, ReplicaSets, and much more)
- `kube-scheduler` runs the scheduler
(it's conceptually not different from another controller)
- `cloud-controller-manager` takes care of "cloud stuff"
(e.g. provisioning load balancers, persistent volumes...)
- Some components mentioned above are also controllers
(e.g. Network Policy Controller)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## More controllers
- Cloud resources can also be managed by additional controllers
(e.g. the [AWS Load Balancer Controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller))
- Leveraging Ingress resources requires an Ingress Controller
(many options available here; we can even install multiple ones!)
- Many add-ons (including CRDs and operators) have controllers as well
🤔 *What's even a controller ?!?*
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## What's a controller?
According to the [documentation](https://kubernetes.io/docs/concepts/architecture/controller/):
*Controllers are **control loops** that
**watch** the state of your cluster,
then make or request changes where needed.*
*Each controller tries to move the current cluster state closer to the desired state.*
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## What controllers do
- Watch resources
- Make changes:
- purely at the API level (e.g. Deployment, ReplicaSet controllers)
- and/or configure resources (e.g. `kube-proxy`)
- and/or provision resources (e.g. load balancer controller)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Extending Kubernetes with controllers
- Random example:
- watch resources like Deployments, Services ...
- read annotations to configure monitoring
- Technically, this is not extending the API
(but it can still be very useful!)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Other ways to extend Kubernetes
- Prevent or alter API requests before resources are committed to storage:
*Admission Control*
- Create new resource types leveraging Kubernetes storage facilities:
*Custom Resource Definitions*
- Create new resource types with different storage or different semantics:
*Aggregation Layer*
- Spoiler alert: often, we will combine multiple techniques
(and involve controllers as well!)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Admission controllers
- Admission controllers can vet or transform API requests
- The diagram on the next slide shows the path of an API request
(courtesy of Banzai Cloud)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
class: pic

.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Types of admission controllers
- *Validating* admission controllers can accept/reject the API call
- *Mutating* admission controllers can modify the API request payload
- Both types can also trigger additional actions
(e.g. automatically create a Namespace if it doesn't exist)
- There are a number of built-in admission controllers
(see [documentation](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#what-does-each-admission-controller-do) for a list)
- We can also dynamically define and register our own
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
class: extra-details
## Some built-in admission controllers
- ServiceAccount:
automatically adds a ServiceAccount to Pods that don't explicitly specify one
- LimitRanger:
applies resource constraints specified by LimitRange objects when Pods are created
- NamespaceAutoProvision:
automatically creates namespaces when an object is created in a non-existent namespace
*Note: #1 and #2 are enabled by default; #3 is not.*
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Dynamic Admission Control
- We can set up *admission webhooks* to extend the behavior of the API server
- The API server will submit incoming API requests to these webhooks
- These webhooks can be *validating* or *mutating*
- Webhooks can be set up dynamically (without restarting the API server)
- To setup a dynamic admission webhook, we create a special resource:
a `ValidatingWebhookConfiguration` or a `MutatingWebhookConfiguration`
- These resources are created and managed like other resources
(i.e. `kubectl create`, `kubectl get`...)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Webhook Configuration
- A ValidatingWebhookConfiguration or MutatingWebhookConfiguration contains:
- the address of the webhook
- the authentication information to use with the webhook
- a list of rules
- The rules indicate for which objects and actions the webhook is triggered
(to avoid e.g. triggering webhooks when setting up webhooks)
- The webhook server can be hosted in or out of the cluster
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Dynamic Admission Examples
- Policy control
([Kyverno](https://kyverno.io/),
[Open Policy Agent](https://www.openpolicyagent.org/docs/latest/))
- Sidecar injection
(Used by some service meshes)
- Type validation
(More on this later, in the CRD section)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Kubernetes API types
- Almost everything in Kubernetes is materialized by a resource
- Resources have a type (or "kind")
(similar to strongly typed languages)
- We can see existing types with `kubectl api-resources`
- We can list resources of a given type with `kubectl get `
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Creating new types
- We can create new types with Custom Resource Definitions (CRDs)
- CRDs are created dynamically
(without recompiling or restarting the API server)
- CRDs themselves are resources:
- we can create a new type with `kubectl create` and some YAML
- we can see all our custom types with `kubectl get crds`
- After we create a CRD, the new type works just like built-in types
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Examples
- Representing composite resources
(e.g. clusters like databases, messages queues ...)
- Representing external resources
(e.g. virtual machines, object store buckets, domain names ...)
- Representing configuration for controllers and operators
(e.g. custom Ingress resources, certificate issuers, backups ...)
- Alternate representations of other objects; services and service instances
(e.g. encrypted secret, git endpoints ...)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## The aggregation layer
- We can delegate entire parts of the Kubernetes API to external servers
- This is done by creating APIService resources
(check them with `kubectl get apiservices`!)
- The APIService resource maps a type (kind) and version to an external service
- All requests concerning that type are sent (proxied) to the external service
- This allows to have resources like CRDs, but that aren't stored in etcd
- Example: `metrics-server`
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Why?
- Using a CRD for live metrics would be extremely inefficient
(etcd **is not** a metrics store; write performance is way too slow)
- Instead, `metrics-server`:
- collects metrics from kubelets
- stores them in memory
- exposes them as PodMetrics and NodeMetrics (in API group metrics.k8s.io)
- is registered as an APIService
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Drawbacks
- Requires a server
- ... that implements a non-trivial API (aka the Kubernetes API semantics)
- If we need REST semantics, CRDs are probably way simpler
- *Sometimes* synchronizing external state with CRDs might do the trick
(unless we want the external state to be our single source of truth)
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
## Documentation
- [Custom Resource Definitions: when to use them](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
- [Custom Resources Definitions: how to use them](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/)
- [Built-in Admission Controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
- [Dynamic Admission Controllers](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
- [Aggregation Layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/)
???
:EN:- Overview of Kubernetes API extensions
:FR:- Comment étendre l'API Kubernetes
.debug[[k8s/extending-api.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/extending-api.md)]
---
class: pic
.interstitial[]
---
name: toc-api-server-internals
class: title
API server internals
.nav[
[Previous part](#toc-extending-the-kubernetes-api)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-custom-resource-definitions)
]
.debug[(automatically generated title slide)]
---
# API server internals
- Understanding the internals of the API server is useful.red[¹]:
- when extending the Kubernetes API server (CRDs, webhooks...)
- when running Kubernetes at scale
- Let's dive into a bit of code!
.footnote[.red[¹]And by *useful*, we mean *strongly recommended or else...*]
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## The main handler
- The API server parses its configuration, and builds a `GenericAPIServer`
- ... which contains an `APIServerHandler` ([src](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/server/handler.go#L37
))
- ... which contains a couple of `http.Handler` fields
- Requests go through:
- `FullhandlerChain` (a series of HTTP filters, see next slide)
- `Director` (switches the request to `GoRestfulContainer` or `NonGoRestfulMux`)
- `GoRestfulContainer` is for "normal" APIs; integrates nicely with OpenAPI
- `NonGoRestfulMux` is for everything else (e.g. proxy, delegation)
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## The chain of handlers
- API requests go through a complex chain of filters ([src](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/server/config.go#L671))
(note when reading that code: requests start at the bottom and go up)
- This is where authentication, authorization, and admission happen
(as well as a few other things!)
- Let's review an arbitrary selection of some of these handlers!
*In the following slides, the handlers are in chronological order.*
*Note: handlers are nested; so they can act at the beginning and end of a request.*
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## `WithPanicRecovery`
- Reminder about Go: there is no exception handling in Go; instead:
- functions typically return a composite `(SomeType, error)` type
- when things go really bad, the code can call `panic()`
- `panic()` can be caught with `recover()`
(but this is almost never used like an exception handler!)
- The API server code is not supposed to `panic()`
- But just in case, we have that handler to prevent (some) crashes
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## `WithRequestInfo` ([src](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/request/requestinfo.go#L163))
- Parse out essential information:
API group, version, namespace, resource, subresource, verb ...
- WithRequestInfo: parse out API group+version, Namespace, resource, subresource ...
- Maps HTTP verbs (GET, PUT, ...) to Kubernetes verbs (list, get, watch, ...)
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
class: extra-details
## HTTP verb mapping
- POST → create
- PUT → update
- PATCH → patch
- DELETE
→ delete (if a resource name is specified)
→ deletecollection (otherwise)
- GET, HEAD
→ get (if a resource name is specified)
→ list (otherwise)
→ watch (if the `?watch=true` option is specified)
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## `WithWaitGroup`
- When we shutdown, tells clients (with in-flight requests) to retry
- only for "short" requests
- for long running requests, the client needs to do more
- Long running requests include `watch` verb, `proxy` sub-resource
(See also `WithTimeoutForNonLongRunningRequests`)
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## AuthN and AuthZ
- `WithAuthentication`:
the request goes through a *chain* of authenticators
([src](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/filters/authentication.go#L38))
- WithAudit
- WithImpersonation: used for e.g. `kubectl ... --as another.user`
- WithPriorityAndFairness or WithMaxInFlightLimit
(`system:masters` can bypass these)
- WithAuthorization
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
## After all these handlers ...
- We get to the "director" mentioned above
- Api Groups get installed into the "gorestfulhandler"
([src](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/server/genericapiserver.go#L423))
- REST-ish resources are managed by various handlers
(in [this directory](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/))
- These files show us the code path for each type of request
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
class: extra-details
## Request code path
- [create.go](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/create.go):
decode to HubGroupVersion; admission; mutating admission; store
- [delete.go](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/delete.go):
validating admission only; deletion
- [get.go](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/get.go) (get, list):
directly fetch from rest storage abstraction
- [patch.go](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/patch.go):
admission; mutating admission; patch
- [update.go](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/update.go):
decode to HubGroupVersion; admission; mutating admission; store
- [watch.go](https://github.com/kubernetes/apiserver/blob/release-1.19/pkg/endpoints/handlers/watch.go):
similar to get.go, but with watch logic
(HubGroupVersion = in-memory, "canonical" version.)
???
:EN:- Kubernetes API server internals
:FR:- Fonctionnement interne du serveur API
.debug[[k8s/apiserver-deepdive.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apiserver-deepdive.md)]
---
class: pic
.interstitial[]
---
name: toc-custom-resource-definitions
class: title
Custom Resource Definitions
.nav[
[Previous part](#toc-api-server-internals)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-the-aggregation-layer)
]
.debug[(automatically generated title slide)]
---
# Custom Resource Definitions
- CRDs are one of the (many) ways to extend the API
- CRDs can be defined dynamically
(no need to recompile or reload the API server)
- A CRD is defined with a CustomResourceDefinition resource
(CustomResourceDefinition is conceptually similar to a *metaclass*)
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## Creating a CRD
- We will create a CRD to represent different recipes of pizzas
- We will be able to run `kubectl get pizzas` and it will list the recipes
- Creating/deleting recipes won't do anything else
(because we won't implement a *controller*)
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## A bit of history
Things related to Custom Resource Definitions:
- Kubernetes 1.??: `apiextensions.k8s.io/v1beta1` introduced
- Kubernetes 1.16: `apiextensions.k8s.io/v1` introduced
- Kubernetes 1.22: `apiextensions.k8s.io/v1beta1` [removed][changes-in-122]
- Kubernetes 1.25: [CEL validation rules available in beta][crd-validation-rules-beta]
- Kubernetes 1.28: [validation ratcheting][validation-ratcheting] in [alpha][feature-gates]
- Kubernetes 1.29: [CEL validation rules available in GA][cel-validation-rules]
- Kubernetes 1.30: [validation ratcheting][validation-ratcheting] in [beta][feature-gates]; enabled by default
[crd-validation-rules-beta]: https://kubernetes.io/blog/2022/09/23/crd-validation-rules-beta/
[cel-validation-rules]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules
[validation-ratcheting]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4008-crd-ratcheting
[feature-gates]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-gates-for-alpha-or-beta-features
[changes-in-122]: https://kubernetes.io/blog/2021/07/14/upcoming-changes-in-kubernetes-1-22/
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## First slice of pizza
```yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: pizzas.container.training
spec:
group: container.training
version: v1alpha1
scope: Namespaced
names:
plural: pizzas
singular: pizza
kind: Pizza
shortNames:
- piz
```
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## The joys of API deprecation
- Unfortunately, the CRD manifest on the previous slide is deprecated!
- It is using `apiextensions.k8s.io/v1beta1`, which is dropped in Kubernetes 1.22
- We need to use `apiextensions.k8s.io/v1`, which is a little bit more complex
(a few optional things become mandatory, see [this guide](https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122) for details)
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## Second slice of pizza
- The next slide will show file [k8s/pizza-2.yaml](https://github.com/jpetazzo/container.training/tree/master/k8s/pizza-2.yaml)
- Note the `spec.versions` list
- we need exactly one version with `storage: true`
- we can have multiple versions with `served: true`
- `spec.versions[].schema.openAPI3Schema` is required
(and must be a valid OpenAPI schema; here it's a trivial one)
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
```yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: pizzas.container.training
spec:
group: container.training
scope: Namespaced
names:
plural: pizzas
singular: pizza
kind: Pizza
shortNames:
- piz
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
```
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## Baking some pizza
- Let's create the Custom Resource Definition for our Pizza resource
.lab[
- Load the CRD:
```bash
kubectl apply -f ~/container.training/k8s/pizza-2.yaml
```
- Confirm that it shows up:
```bash
kubectl get crds
```
]
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## Creating custom resources
The YAML below defines a resource using the CRD that we just created:
```yaml
kind: Pizza
apiVersion: container.training/v1alpha1
metadata:
name: hawaiian
spec:
toppings: [ cheese, ham, pineapple ]
```
.lab[
- Try to create a few pizza recipes:
```bash
kubectl apply -f ~/container.training/k8s/pizzas.yaml
```
]
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## Type validation
- Recent versions of Kubernetes will issue errors about unknown fields
- We need to improve our OpenAPI schema
(to add e.g. the `spec.toppings` field used by our pizza resources)
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## Creating a bland pizza
- Let's try to create a pizza anyway!
.lab[
- Only provide the most basic YAML manifest:
```bash
kubectl create -f- <
(e.g. major version downgrades)
- checking a key or certificate format or validity
- and much more!
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## CRDs in the wild
- [gitkube](https://storage.googleapis.com/gitkube/gitkube-setup-stable.yaml)
- [A redis operator](https://github.com/amaizfinance/redis-operator/blob/master/deploy/crds/k8s_v1alpha1_redis_crd.yaml)
- [cert-manager](https://github.com/jetstack/cert-manager/releases/download/v1.0.4/cert-manager.yaml)
*How big are these YAML files?*
*What's the size (e.g. in lines) of each resource?*
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## CRDs in practice
- Production-grade CRDs can be extremely verbose
(because of the openAPI schema validation)
- This can (and usually will) be managed by a framework
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## (Ab)using the API server
- If we need to store something "safely" (as in: in etcd), we can use CRDs
- This gives us primitives to read/write/list objects (and optionally validate them)
- The Kubernetes API server can run on its own
(without the scheduler, controller manager, and kubelets)
- By loading CRDs, we can have it manage totally different objects
(unrelated to containers, clusters, etc.)
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
## What's next?
- Creating a basic CRD is relatively straightforward
- But CRDs generally require a *controller* to do anything useful
- The controller will typically *watch* our custom resources
(and take action when they are created/updated)
- Most serious use-cases will also require *validation web hooks*
- When our CRD data format evolves, we'll also need *conversion web hooks*
- Doing all that work manually is tedious; use a framework!
???
:EN:- Custom Resource Definitions (CRDs)
:FR:- Les CRDs *(Custom Resource Definitions)*
.debug[[k8s/crd.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/crd.md)]
---
class: pic
.interstitial[]
---
name: toc-the-aggregation-layer
class: title
The Aggregation Layer
.nav[
[Previous part](#toc-custom-resource-definitions)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-dynamic-admission-control)
]
.debug[(automatically generated title slide)]
---
# The Aggregation Layer
- The aggregation layer is a way to extend the Kubernetes API
- It is similar to CRDs
- it lets us define new resource types
- these resources can then be used with `kubectl` and other clients
- The implementation is very different
- CRDs are handled within the API server
- the aggregation layer offloads requests to another process
- They are designed for very different use-cases
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## CRDs vs aggregation layer
- The Kubernetes API is a REST-ish API with a hierarchical structure
- It can be extended with Custom Resource Definifions (CRDs)
- Custom resources are managed by the Kubernetes API server
- we don't need to write code
- the API server does all the heavy lifting
- these resources are persisted in Kubernetes' "standard" database
(for most installations, that's `etcd`)
- We can also define resources that are *not* managed by the API server
(the API server merely proxies the requests to another server)
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Which one is best?
- For things that "map" well to objects stored in a traditional database:
*probably CRDs*
- For things that "exist" only in Kubernetes and don't represent external resources:
*probably CRDs*
- For things that are read-only, at least from Kubernetes' perspective:
*probably aggregation layer*
- For things that can't be stored in etcd because of size or access patterns:
*probably aggregation layer*
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## How are resources organized?
- Let's have a look at the Kubernetes API hierarchical structure
- We'll ask `kubectl` to show us the exacts requests that it's making
.lab[
- Check the URI for a cluster-scope, "core" resource, e.g. a Node:
```bash
kubectl -v6 get node node1
```
- Check the URI for a cluster-scope, "non-core" resource, e.g. a ClusterRole:
```bash
kubectl -v6 get clusterrole view
```
]
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Core vs non-core
- This is the structure of the URIs that we just checked:
```
/api/v1/nodes/node1
↑ ↑ ↑
`version` `kind` `name`
/apis/rbac.authorization.k8s.io/v1/clusterroles/view
↑ ↑ ↑ ↑
`group` `version` `kind` `name`
```
- There is no group for "core" resources
- Or, we could say that the group, `core`, is implied
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Group-Version-Kind
- In the API server, the Group-Version-Kind triple maps to a Go type
(look for all the "GVK" occurrences in the source code!)
- In the API server URI router, the GVK is parsed "relatively early"
(so that the server can know which resource we're talking about)
- "Well, actually ..." Things are a bit more complicated, see next slides!
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
class: extra-details
## Namespaced resources
- What about namespaced resources?
.lab[
- Check the URI for a namespaced, "core" resource, e.g. a Service:
```bash
kubectl -v6 get service kubernetes --namespace default
```
]
- Here are what namespaced resources URIs look like:
```
/api/v1/namespaces/default/services/kubernetes
↑ ↑ ↑ ↑
`version` `namespace` `kind` `name`
/apis/apps/v1/namespaces/kube-system/daemonsets/kube-proxy
↑ ↑ ↑ ↑ ↑
`group` `version` `namespace` `kind` `name`
```
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
class: extra-details
## Subresources
- Many resources have *subresources*, for instance:
- `/status` (decouples status updates from other updates)
- `/scale` (exposes a consistent interface for autoscalers)
- `/proxy` (allows access to HTTP resources)
- `/portforward` (used by `kubectl port-forward`)
- `/logs` (access pod logs)
- These are added at the end of the URI
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
class: extra-details
## Accessing a subresource
.lab[
- List `kube-proxy` pods:
```bash
kubectl get pods --namespace=kube-system --selector=k8s-app=kube-proxy
PODNAME=$(
kubectl get pods --namespace=kube-system --selector=k8s-app=kube-proxy \
-o json | jq -r .items[0].metadata.name)
```
- Execute a command in a pod, showing the API requests:
```bash
kubectl -v6 exec --namespace=kube-system $PODNAME -- echo hello world
```
]
--
The full request looks like:
```
POST https://.../api/v1/namespaces/kube-system/pods/kube-proxy-c7rlw/exec?
command=echo&command=hello&command=world&container=kube-proxy&stderr=true&stdout=true
```
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Listing what's supported on the server
- There are at least three useful commands to introspect the API server
.lab[
- List resources types, their group, kind, short names, and scope:
```bash
kubectl api-resources
```
- List API groups + versions:
```bash
kubectl api-versions
```
- List APIServices:
```bash
kubectl get apiservices
```
]
--
🤔 What's the difference between the last two?
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## API registration
- `kubectl api-versions` shows all API groups, including `apiregistration.k8s.io`
- `kubectl get apiservices` shows the "routing table" for API requests
- The latter doesn't show `apiregistration.k8s.io`
(APIServices belong to `apiregistration.k8s.io`)
- Most API groups are `Local` (handled internally by the API server)
- If we're running the `metrics-server`, it should handle `metrics.k8s.io`
- This is an API group handled *outside* of the API server
- This is the *aggregation layer!*
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Finding resources
The following assumes that `metrics-server` is deployed on your cluster.
.lab[
- Check that the metrics.k8s.io is registered with `metrics-server`:
```bash
kubectl get apiservices | grep metrics.k8s.io
```
- Check the resource kinds registered in the metrics.k8s.io group:
```bash
kubectl api-resources --api-group=metrics.k8s.io
```
]
(If the output of either command is empty, install `metrics-server` first.)
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## `nodes` vs `nodes`
- We can have multiple resources with the same name
.lab[
- Look for resources named `node`:
```bash
kubectl api-resources | grep -w nodes
```
- Compare the output of both commands:
```bash
kubectl get nodes
kubectl get nodes.metrics.k8s.io
```
]
--
🤔 What are the second kind of nodes? How can we see what's really in them?
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Node vs NodeMetrics
- `nodes.metrics.k8s.io` (aka NodeMetrics) don't have fancy *printer columns*
- But we can look at the raw data (with `-o json` or `-o yaml`)
.lab[
- Look at NodeMetrics objects with one of these commands:
```bash
kubectl get -o yaml nodes.metrics.k8s.io
kubectl get -o yaml NodeMetrics
```
]
--
💡 Alright, these are the live metrics (CPU, RAM) for our nodes.
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## An easier way to consume metrics
- We might have seen these metrics before ... With an easier command!
--
.lab[
- Display node metrics:
```bash
kubectl top nodes
```
- Check which API requests happen behind the scenes:
```bash
kubectl top nodes -v6
```
]
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Aggregation layer in practice
- We can write an API server to handle a subset of the Kubernetes API
- Then we can register that server by creating an APIService resource
.lab[
- Check the definition used for the `metrics-server`:
```bash
kubectl describe apiservices v1beta1.metrics.k8s.io
```
]
- Group priority is used when multiple API groups provide similar kinds
(e.g. `nodes` and `nodes.metrics.k8s.io` as seen earlier)
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Authentication flow
- We have two Kubernetes API servers:
- "aggregator" (the main one; clients connect to it)
- "aggregated" (the one providing the extra API; aggregator connects to it)
- Aggregator deals with client authentication
- Aggregator authenticates with aggregated using mutual TLS
- Aggregator passes (/forwards/proxies/...) requests to aggregated
- Aggregated performs authorization by calling back aggregator
("can subject X perform action Y on resource Z?")
[This doc page](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/#authentication-flow) has very nice swim lanes showing that flow.
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
## Discussion
- Aggregation layer is great for metrics
(fast-changing, ephemeral data, that would be outrageously bad for etcd)
- It *could* be a good fit to expose other REST APIs as a pass-thru
(but it's more common to see CRDs instead)
???
:EN:- The aggregation layer
:FR:- Étendre l'API avec le *aggregation layer*
.debug[[k8s/aggregation-layer.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/aggregation-layer.md)]
---
class: pic
.interstitial[]
---
name: toc-dynamic-admission-control
class: title
Dynamic Admission Control
.nav[
[Previous part](#toc-the-aggregation-layer)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-operators)
]
.debug[(automatically generated title slide)]
---
# Dynamic Admission Control
- This is one of the many ways to extend the Kubernetes API
- High level summary: dynamic admission control relies on webhooks that are ...
- dynamic (can be added/removed on the fly)
- running inside our outside the cluster
- *validating* (yay/nay) or *mutating* (can change objects that are created/updated)
- selective (can be configured to apply only to some kinds, some selectors...)
- mandatory or optional (should it block operations when webhook is down?)
- Used for themselves (e.g. policy enforcement) or as part of operators
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Use cases
- Defaulting
*injecting image pull secrets, sidecars, environment variables...*
- Policy enforcement and best practices
*prevent: `latest` images, deprecated APIs...*
*require: PDBs, resource requests/limits, labels/annotations, local registry...*
- Problem mitigation
*block nodes with vulnerable kernels, inject log4j mitigations...*
- Extended validation for operators
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## You said *dynamic?*
- Some admission controllers are built in the API server
- They are enabled/disabled through Kubernetes API server configuration
(e.g. `--enable-admission-plugins`/`--disable-admission-plugins` flags)
- Here, we're talking about *dynamic* admission controllers
- They can be added/remove while the API server is running
(without touching the configuration files or even having access to them)
- This is done through two kinds of cluster-scope resources:
ValidatingWebhookConfiguration and MutatingWebhookConfiguration
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## You said *webhooks?*
- A ValidatingWebhookConfiguration or MutatingWebhookConfiguration contains:
- a resource filter
(e.g. "all pods", "deployments in namespace xyz", "everything"...)
- an operations filter
(e.g. CREATE, UPDATE, DELETE)
- the address of the webhook server
- Each time an operation matches the filters, it is sent to the webhook server
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## What gets sent exactly?
- The API server will `POST` a JSON object to the webhook
- That object will be a Kubernetes API message with `kind` `AdmissionReview`
- It will contain a `request` field, with, notably:
- `request.uid` (to be used when replying)
- `request.object` (the object created/deleted/changed)
- `request.oldObject` (when an object is modified)
- `request.userInfo` (who was making the request to the API in the first place)
(See [the documentation](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#request) for a detailed example showing more fields.)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## How should the webhook respond?
- By replying with another `AdmissionReview` in JSON
- It should have a `response` field, with, notably:
- `response.uid` (matching the `request.uid`)
- `response.allowed` (`true`/`false`)
- `response.status.message` (optional string; useful when denying requests)
- `response.patchType` (when a mutating webhook changes the object; e.g. `json`)
- `response.patch` (the patch, encoded in base64)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## What if the webhook *does not* respond?
- If "something bad" happens, the API server follows the `failurePolicy` option
- this is a per-webhook option (specified in the webhook configuration)
- it can be `Fail` (the default) or `Ignore` ("allow all, unmodified")
- What's "something bad"?
- webhook responds with something invalid
- webhook takes more than 10 seconds to respond
(this can be changed with `timeoutSeconds` field in the webhook config)
- webhook is down or has invalid certificates
(TLS! It's not just a good idea; for admission control, it's the law!)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## What did you say about TLS?
- The webhook configuration can indicate:
- either `url` of the webhook server (has to begin with `https://`)
- or `service.name` and `service.namespace` of a Service on the cluster
- In the latter case, the Service has to accept TLS connections on port 443
- It has to use a certificate with CN `..svc`
(**and** a `subjectAltName` extension with `DNS:..svc`)
- The certificate needs to be valid (signed by a CA trusted by the API server)
... alternatively, we can pass a `caBundle` in the webhook configuration
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Webhook server inside or outside
- "Outside" webhook server is defined with `url` option
- convenient for external webooks (e.g. tamper-resistent audit trail)
- also great for initial development (e.g. with ngrok)
- requires outbound connectivity (duh) and can become a SPOF
- "Inside" webhook server is defined with `service` option
- convenient when the webhook needs to be deployed and managed on the cluster
- also great for air gapped clusters
- development can be harder (but tools like [Tilt](https://tilt.dev) can help)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Developing a simple admission webhook
- We're going to register a custom webhook!
- First, we'll just dump the `AdmissionRequest` object
(using a little Node app)
- Then, we'll implement a strict policy on a specific label
(using a little Flask app)
- Development will happen in local containers, plumbed with ngrok
- The we will deploy to the cluster 🔥
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Running the webhook locally
- We prepared a Docker Compose file to start the whole stack
(the Node "echo" app, the Flask app, and one ngrok tunnel for each of them)
- We will need an ngrok account for the tunnels
(a free account is fine)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
class: extra-details
## What's ngrok?
- Ngrok provides secure tunnels to access local services
- Example: run `ngrok http 1234`
- `ngrok` will display a publicly-available URL (e.g. https://xxxxyyyyzzzz.ngrok.app)
- Connections to https://xxxxyyyyzzzz.ngrok.app will terminate at `localhost:1234`
- Basic product is free; extra features (vanity domains, end-to-end TLS...) for $$$
- Perfect to develop our webhook!
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
class: extra-details
## Ngrok in production
- Ngrok was initially known for its local webhook development features
- It now supports production scenarios as well
(load balancing, WAF, authentication, circuit-breaking...)
- Including some that are very relevant to Kubernetes
(e.g. [ngrok Ingress Controller](https://github.com/ngrok/kubernetes-ingress-controller)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Ngrok tokens
- If you're attending a live training, you might have an ngrok token
- Look in `~/ngrok.env` and if that file exists, copy it to the stack:
.lab[
```bash
cp ~/ngrok.env ~/container.training/webhooks/admission/.env
```
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Starting the whole stack
.lab[
- Go to the webhook directory:
```bash
cd ~/container.training/webhooks/admission
```
- Start the webhook in Docker containers:
```bash
docker-compose up
```
]
*Note the URL in `ngrok-echo_1` looking like `url=https://xxxx.ngrok.io`.*
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Update the webhook configuration
- We have a webhook configuration in `k8s/webhook-configuration.yaml`
- We need to update the configuration with the correct `url`
.lab[
- Edit the webhook configuration manifest:
```bash
vim k8s/webhook-configuration.yaml
```
- **Uncomment** the `url:` line
- **Update** the `.ngrok.io` URL with the URL shown by Compose
- Save and quit
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Register the webhook configuration
- Just after we register the webhook, it will be called for each matching request
(CREATE and UPDATE on Pods in all namespaces)
- The `failurePolicy` is `Ignore`
(so if the webhook server is down, we can still create pods)
.lab[
- Register the webhook:
```bash
kubectl apply -f k8s/webhook-configuration.yaml
```
]
It is strongly recommended to tail the logs of the API server while doing that.
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Create a pod
- Let's create a pod and try to set a `color` label
.lab[
- Create a pod named `chroma`:
```bash
kubectl run --restart=Never chroma --image=nginx
```
- Add a label `color` set to `pink`:
```bash
kubectl label pod chroma color=pink
```
]
We should see the `AdmissionReview` objects in the Compose logs.
Note: the webhook doesn't do anything (other than printing the request payload).
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Use the "real" admission webhook
- We have a small Flask app implementing a particular policy on pod labels:
- if a pod sets a label `color`, it must be `blue`, `green`, `red`
- once that `color` label is set, it cannot be removed or changed
- That Flask app was started when we did `docker-compose up` earlier
- It is exposed through its own ngrok tunnel
- We are going to use that webhook instead of the other one
(by changing only the `url` field in the ValidatingWebhookConfiguration)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Update the webhook configuration
.lab[
- First, check the ngrok URL of the tunnel for the Flask app:
```bash
docker-compose logs ngrok-flask
```
- Then, edit the webhook configuration:
```bash
kubectl edit validatingwebhookconfiguration admission.container.training
```
- Find the `url:` field with the `.ngrok.io` URL and update it
- Save and quit; the new configuration is applied immediately
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Verify the behavior of the webhook
- Try to create a few pods and/or change labels on existing pods
- What happens if we try to make changes to the earlier pod?
(the one that has `label=pink`)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Deploying the webhook on the cluster
- Let's see what's needed to self-host the webhook server!
- The webhook needs to be reachable through a Service on our cluster
- The Service needs to accept TLS connections on port 443
- We need a proper TLS certificate:
- with the right `CN` and `subjectAltName` (`..svc`)
- signed by a trusted CA
- We can either use a "real" CA, or use the `caBundle` option to specify the CA cert
(the latter makes it easy to use self-signed certs)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## In practice
- We're going to generate a key pair and a self-signed certificate
- We will store them in a Secret
- We will run the webhook in a Deployment, exposed with a Service
- We will update the webhook configuration to use that Service
- The Service will be named `admission`, in Namespace `webhooks`
(keep in mind that the ValidatingWebhookConfiguration itself is at cluster scope)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Let's get to work!
.lab[
- Make sure we're in the right directory:
```bash
cd ~/container.training/webhooks/admission
```
- Create the namespace:
```bash
kubectl create namespace webhooks
```
- Switch to the namespace:
```bash
kubectl config set-context --current --namespace=webhooks
```
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Deploying the webhook
- *Normally,* we would author an image for this
- Since our webhook is just *one* Python source file ...
... we'll store it in a ConfigMap, and install dependencies on the fly
.lab[
- Load the webhook source in a ConfigMap:
```bash
kubectl create configmap admission --from-file=flask/webhook.py
```
- Create the Deployment and Service:
```bash
kubectl apply -f k8s/webhook-server.yaml
```
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Generating the key pair and certificate
- Let's call OpenSSL to the rescue!
(of course, there are plenty others options; e.g. `cfssl`)
.lab[
- Generate a self-signed certificate:
```bash
NAMESPACE=webhooks
SERVICE=admission
CN=$SERVICE.$NAMESPACE.svc
openssl req -x509 -newkey rsa:4096 -nodes -keyout key.pem -out cert.pem \
-days 30 -subj /CN=$CN -addext subjectAltName=DNS:$CN
```
- Load up the key and cert in a Secret:
```bash
kubectl create secret tls admission --cert=cert.pem --key=key.pem
```
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Update the webhook configuration
- Let's reconfigure the webhook to use our Service instead of ngrok
.lab[
- Edit the webhook configuration manifest:
```bash
vim k8s/webhook-configuration.yaml
```
- Comment out the `url:` line
- Uncomment the `service:` section
- Save, quit
- Update the webhook configuration:
```bash
kubectl apply -f k8s/webhook-configuration.yaml
```
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Add our self-signed cert to the `caBundle`
- The API server won't accept our self-signed certificate
- We need to add it to the `caBundle` field in the webhook configuration
- The `caBundle` will be our `cert.pem` file, encoded in base64
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
Shell to the rescue!
.lab[
- Load up our cert and encode it in base64:
```bash
CA=$(base64 -w0 < cert.pem)
```
- Define a patch operation to update the `caBundle`:
```bash
PATCH='[{
"op": "replace",
"path": "/webhooks/0/clientConfig/caBundle",
"value":"'$CA'"
}]'
```
- Patch the webhook configuration:
```bash
kubectl patch validatingwebhookconfiguration \
admission.webhook.container.training \
--type='json' -p="$PATCH"
```
]
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Try it out!
- Keep an eye on the API server logs
- Tail the logs of the pod running the webhook server
- Create a few pods; we should see requests in the webhook server logs
- Check that the label `color` is enforced correctly
(it should only allow values of `red`, `green`, `blue`)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
## Coming soon...
- Kubernetes Validating Admission Policies
- Integrated with the Kubernetes API server
- Lets us define policies using [CEL (Common Expression Language)][cel-spec]
- Available in beta in Kubernetes 1.28
- Check this [CNCF Blog Post][cncf-blog-vap] for more details
[cncf-blog-vap]: https://www.cncf.io/blog/2023/09/14/policy-management-in-kubernetes-is-changing/
[cel-spec]: https://github.com/google/cel-spec
???
:EN:- Dynamic admission control with webhooks
:FR:- Contrôle d'admission dynamique (webhooks)
.debug[[k8s/admission.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/admission.md)]
---
class: pic
.interstitial[]
---
name: toc-operators
class: title
Operators
.nav[
[Previous part](#toc-dynamic-admission-control)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-designing-an-operator)
]
.debug[(automatically generated title slide)]
---
# Operators
The Kubernetes documentation describes the [Operator pattern] as follows:
*Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop.*
Another good definition from [CoreOS](https://coreos.com/blog/introducing-operators.html):
*An operator represents **human operational knowledge in software,**
to reliably manage an application.*
There are many different use cases spanning different domains; but the general idea is:
*Manage some resources (that reside inside our outside the cluster),
using Kubernetes manifests and tooling.*
[Operator pattern]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
.debug[[k8s/operators.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators.md)]
---
## Some uses cases
- Managing external resources ([AWS], [GCP], [KubeVirt]...)
- Setting up database replication or distributed systems
(Cassandra, Consul, CouchDB, ElasticSearch, etcd, Kafka, MongoDB, MySQL, PostgreSQL, RabbitMQ, Redis, ZooKeeper...)
- Running and configuring CI/CD
([ArgoCD], [Flux]), backups ([Velero]), policies ([Gatekeeper], [Kyverno])...
- Automating management of certificates and secrets
([cert-manager]), secrets ([External Secrets Operator], [Sealed Secrets]...)
- Configuration of cluster components ([Istio], [Prometheus])
- etc.
[ArgoCD]: https://argoproj.github.io/cd/
[AWS]: https://aws-controllers-k8s.github.io/community/docs/community/services/
[cert-manager]: https://cert-manager.io/
[External Secrets Operator]: https://external-secrets.io/
[Flux]: https://fluxcd.io/
[Gatekeeper]: https://open-policy-agent.github.io/gatekeeper/website/docs/
[GCP]: https://github.com/paulczar/gcp-cloud-compute-operator
[Istio]: https://istio.io/latest/docs/setup/install/operator/
[KubeVirt]: https://kubevirt.io/
[Kyverno]: https://kyverno.io/
[Prometheus]: https://prometheus-operator.dev/
[Sealed Secrets]: https://github.com/bitnami-labs/sealed-secrets
[Velero]: https://velero.io/
.debug[[k8s/operators.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators.md)]
---
## What are they made from?
- Operators combine two things:
- Custom Resource Definitions
- controller code watching the corresponding resources and acting upon them
- A given operator can define one or multiple CRDs
- The controller code (control loop) typically runs within the cluster
(running as a Deployment with 1 replica is a common scenario)
- But it could also run elsewhere
(nothing mandates that the code run on the cluster, as long as it has API access)
.debug[[k8s/operators.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators.md)]
---
## Operators for e.g. replicated databases
- Kubernetes gives us Deployments, StatefulSets, Services ...
- These mechanisms give us building blocks to deploy applications
- They work great for services that are made of *N* identical containers
(like stateless ones)
- They also work great for some stateful applications like Consul, etcd ...
(with the help of highly persistent volumes)
- They're not enough for complex services:
- where different containers have different roles
- where extra steps have to be taken when scaling or replacing containers
.debug[[k8s/operators.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators.md)]
---
## How operators work
- An operator creates one or more CRDs
(i.e., it creates new "Kinds" of resources on our cluster)
- The operator also runs a *controller* that will watch its resources
- Each time we create/update/delete a resource, the controller is notified
(we could write our own cheap controller with `kubectl get --watch`)
.debug[[k8s/operators.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators.md)]
---
## Operators are not magic
- Look at this ElasticSearch resource definition:
[k8s/eck-elasticsearch.yaml](https://github.com/jpetazzo/container.training/tree/master/k8s/eck-elasticsearch.yaml)
- What should happen if we flip the TLS flag? Twice?
- What should happen if we add another group of nodes?
- What if we want different images or parameters for the different nodes?
*Operators can be very powerful.
But we need to know exactly the scenarios that they can handle.*
???
:EN:- Kubernetes operators
:FR:- Les opérateurs
.debug[[k8s/operators.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators.md)]
---
class: pic
.interstitial[]
---
name: toc-designing-an-operator
class: title
Designing an operator
.nav[
[Previous part](#toc-operators)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-writing-a-tiny-operator)
]
.debug[(automatically generated title slide)]
---
# Designing an operator
- Once we understand CRDs and operators, it's tempting to use them everywhere
- Yes, we can do (almost) everything with operators ...
- ... But *should we?*
- Very often, the answer is **“no!”**
- Operators are powerful, but significantly more complex than other solutions
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## When should we (not) use operators?
- Operators are great if our app needs to react to cluster events
(nodes or pods going down, and requiring extensive reconfiguration)
- Operators *might* be helpful to encapsulate complexity
(manipulate one single custom resource for an entire stack)
- Operators are probably overkill if a Helm chart would suffice
- That being said, if we really want to write an operator ...
Read on!
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## What does it take to write an operator?
- Writing a quick-and-dirty operator, or a POC/MVP, is easy
- Writing a robust operator is hard
- We will describe the general idea
- We will identify some of the associated challenges
- We will list a few tools that can help us
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Top-down vs. bottom-up
- Both approaches are possible
- Let's see what they entail, and their respective pros and cons
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Top-down approach
- Start with high-level design (see next slide)
- Pros:
- can yield cleaner design that will be more robust
- Cons:
- must be able to anticipate all the events that might happen
- design will be better only to the extent of what we anticipated
- hard to anticipate if we don't have production experience
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## High-level design
- What are we solving?
(e.g.: geographic databases backed by PostGIS with Redis caches)
- What are our use-cases, stories?
(e.g.: adding/resizing caches and read replicas; load balancing queries)
- What kind of outage do we want to address?
(e.g.: loss of individual node, pod, volume)
- What are our *non-features*, the things we don't want to address?
(e.g.: loss of datacenter/zone; differentiating between read and write queries;
cache invalidation; upgrading to newer major versions of Redis, PostGIS, PostgreSQL)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Low-level design
- What Custom Resource Definitions do we need?
(one, many?)
- How will we store configuration information?
(part of the CRD spec fields, annotations, other?)
- Do we need to store state? If so, where?
- state that is small and doesn't change much can be stored via the Kubernetes API
(e.g.: leader information, configuration, credentials)
- things that are big and/or change a lot should go elsewhere
(e.g.: metrics, bigger configuration file like GeoIP)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
class: extra-details
## What can we store via the Kubernetes API?
- The API server stores most Kubernetes resources in etcd
- Etcd is designed for reliability, not for performance
- If our storage needs exceed what etcd can offer, we need to use something else:
- either directly
- or by extending the API server
(for instance by using the aggregation layer, like [metrics server](https://github.com/kubernetes-incubator/metrics-server) does)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Bottom-up approach
- Start with existing Kubernetes resources (Deployment, Stateful Set...)
- Run the system in production
- Add scripts, automation, to facilitate day-to-day operations
- Turn the scripts into an operator
- Pros: simpler to get started; reflects actual use-cases
- Cons: can result in convoluted designs requiring extensive refactor
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## General idea
- Our operator will watch its CRDs *and associated resources*
- Drawing state diagrams and finite state automata helps a lot
- It's OK if some transitions lead to a big catch-all "human intervention"
- Over time, we will learn about new failure modes and add to these diagrams
- It's OK to start with CRD creation / deletion and prevent any modification
(that's the easy POC/MVP we were talking about)
- *Presentation* and *validation* will help our users
(more on that later)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Challenges
- Reacting to infrastructure disruption can seem hard at first
- Kubernetes gives us a lot of primitives to help:
- Pods and Persistent Volumes will *eventually* recover
- Stateful Sets give us easy ways to "add N copies" of a thing
- The real challenges come with configuration changes
(i.e., what to do when our users update our CRDs)
- Keep in mind that [some] of the [largest] cloud [outages] haven't been caused by [natural catastrophes], or even code bugs, but by configuration changes
[some]: https://www.datacenterdynamics.com/news/gcp-outage-mainone-leaked-google-cloudflare-ip-addresses-china-telecom/
[largest]: https://aws.amazon.com/message/41926/
[outages]: https://aws.amazon.com/message/65648/
[natural catastrophes]: https://www.datacenterknowledge.com/amazon/aws-says-it-s-never-seen-whole-data-center-go-down
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Configuration changes
- It is helpful to analyze and understand how Kubernetes controllers work:
- watch resource for modifications
- compare desired state (CRD) and current state
- issue actions to converge state
- Configuration changes will probably require *another* state diagram or FSA
- Again, it's OK to have transitions labeled as "unsupported"
(i.e. reject some modifications because we can't execute them)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Tools
- CoreOS / RedHat Operator Framework
[GitHub](https://github.com/operator-framework)
|
[Blog](https://developers.redhat.com/blog/2018/12/18/introduction-to-the-kubernetes-operator-framework/)
|
[Intro talk](https://www.youtube.com/watch?v=8k_ayO1VRXE)
|
[Deep dive talk](https://www.youtube.com/watch?v=fu7ecA2rXmc)
|
[Simple example](https://medium.com/faun/writing-your-first-kubernetes-operator-8f3df4453234)
- Kubernetes Operator Pythonic Framework (KOPF)
[GitHub](https://github.com/nolar/kopf)
|
[Docs](https://kopf.readthedocs.io/)
|
[Step-by-step tutorial](https://kopf.readthedocs.io/en/stable/walkthrough/problem/)
- Mesosphere Kubernetes Universal Declarative Operator (KUDO)
[GitHub](https://github.com/kudobuilder/kudo)
|
[Blog](https://mesosphere.com/blog/announcing-maestro-a-declarative-no-code-approach-to-kubernetes-day-2-operators/)
|
[Docs](https://kudo.dev/)
|
[Zookeeper example](https://github.com/kudobuilder/frameworks/tree/master/repo/stable/zookeeper)
- Kubebuilder (Go, very close to the Kubernetes API codebase)
[GitHub](https://github.com/kubernetes-sigs/kubebuilder)
|
[Book](https://book.kubebuilder.io/)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Validation
- By default, a CRD is "free form"
(we can put pretty much anything we want in it)
- When creating a CRD, we can provide an OpenAPI v3 schema
([Example](https://github.com/amaizfinance/redis-operator/blob/master/deploy/crds/k8s_v1alpha1_redis_crd.yaml#L34))
- The API server will then validate resources created/edited with this schema
- If we need a stronger validation, we can use a Validating Admission Webhook:
- run an [admission webhook server](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server) to receive validation requests
- register the webhook by creating a [ValidatingWebhookConfiguration](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#configure-admission-webhooks-on-the-fly)
- each time the API server receives a request matching the configuration,
the request is sent to our server for validation
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Presentation
- By default, `kubectl get mycustomresource` won't display much information
(just the name and age of each resource)
- When creating a CRD, we can specify additional columns to print
([Example](https://github.com/amaizfinance/redis-operator/blob/master/deploy/crds/k8s_v1alpha1_redis_crd.yaml#L6),
[Docs](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#additional-printer-columns))
- By default, `kubectl describe mycustomresource` will also be generic
- `kubectl describe` can show events related to our custom resources
(for that, we need to create Event resources, and fill the `involvedObject` field)
- For scalable resources, we can define a `scale` sub-resource
- This will enable the use of `kubectl scale` and other scaling-related operations
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## About scaling
- It is possible to use the HPA (Horizontal Pod Autoscaler) with CRDs
- But it is not always desirable
- The HPA works very well for homogenous, stateless workloads
- For other workloads, your mileage may vary
- Some systems can scale across multiple dimensions
(for instance: increase number of replicas, or number of shards?)
- If autoscaling is desired, the operator will have to take complex decisions
(example: Zalando's Elasticsearch Operator ([Video](https://www.youtube.com/watch?v=lprE0J0kAq0)))
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Versioning
- As our operator evolves over time, we may have to change the CRD
(add, remove, change fields)
- Like every other resource in Kubernetes, [custom resources are versioned](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/
)
- When creating a CRD, we need to specify a *list* of versions
- Versions can be marked as `stored` and/or `served`
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Stored version
- Exactly one version has to be marked as the `stored` version
- As the name implies, it is the one that will be stored in etcd
- Resources in storage are never converted automatically
(we need to read and re-write them ourselves)
- Yes, this means that we can have different versions in etcd at any time
- Our code needs to handle all the versions that still exist in storage
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Served versions
- By default, the Kubernetes API will serve resources "as-is"
(using their stored version)
- It will assume that all versions are compatible storage-wise
(i.e. that the spec and fields are compatible between versions)
- We can provide [conversion webhooks](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/#webhook-conversion) to "translate" requests
(the alternative is to upgrade all stored resources and stop serving old versions)
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Operator reliability
- Remember that the operator itself must be resilient
(e.g.: the node running it can fail)
- Our operator must be able to restart and recover gracefully
- Do not store state locally
(unless we can reconstruct that state when we restart)
- As indicated earlier, we can use the Kubernetes API to store data:
- in the custom resources themselves
- in other resources' annotations
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
## Beyond CRDs
- CRDs cannot use custom storage (e.g. for time series data)
- CRDs cannot support arbitrary subresources (like logs or exec for Pods)
- CRDs cannot support protobuf (for faster, more efficient communication)
- If we need these things, we can use the [aggregation layer](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead
- The aggregation layer proxies all requests below a specific path to another server
(this is used e.g. by the metrics server)
- [This documentation page](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#choosing-a-method-for-adding-custom-resources) compares the features of CRDs and API aggregation
???
:EN:- Guidelines to design our own operators
:FR:- Comment concevoir nos propres opérateurs
.debug[[k8s/operators-design.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-design.md)]
---
class: pic
.interstitial[]
---
name: toc-writing-a-tiny-operator
class: title
Writing a tiny operator
.nav[
[Previous part](#toc-designing-an-operator)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-kubebuilder)
]
.debug[(automatically generated title slide)]
---
# Writing a tiny operator
- Let's look at a simple operator
- It does have:
- a control loop
- resource lifecycle management
- basic logging
- It doesn't have:
- CRDs (and therefore, resource versioning, conversion webhooks...)
- advanced observability (metrics, Kubernetes Events)
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Use case
*When I push code to my source control system, I want that code
to be built into a container image, and that image to be deployed
in a staging environment. I want each branch/tag/commit (depending
on my needs) to be deployed into its specific Kubernetes Namespace.*
- The last part requires the CI/CD pipeline to manage Namespaces
- ...And permissions in these Namespaces
- This requires elevated privileges for the CI/CD pipeline
(read: `cluster-admin`)
- If the CI/CD pipeline is compromised, this can lead to cluster compromise
- This can be a concern if the CI/CD pipeline is part of the repository
(which is the default modus operandi with GitHub, GitLab, Bitbucket...)
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Proposed solution
- On-demand creation of Namespaces
- Creation is triggered by creating a ConfigMap in a dedicated Namespace
- Namespaces are set up with basic permissions
- Credentials are generated for each Namespace
- Credentials only give access to their Namespace
- Credentials are exposed back to the dedicated configuration Namespace
- Operator implemented as a shell script
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## An operator in shell... Really?
- About 150 lines of code
(including comments + white space)
- Performance doesn't matter
- operator work will be a tiny fraction of CI/CD pipeline work
- uses *watch* semantics to minimize control plane load
- Easy to understand, easy to audit, easy to tweak
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Show me the code!
- GitHub repository and documentation:
https://github.com/jpetazzo/nsplease
- Operator source code:
https://github.com/jpetazzo/nsplease/blob/main/nsplease.sh
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Main loop
```bash
info "Waiting for ConfigMap events in $REQUESTS_NAMESPACE..."
kubectl --namespace $REQUESTS_NAMESPACE get configmaps \
--watch --output-watch-events -o json \
| jq --unbuffered --raw-output '[.type,.object.metadata.name] | @tsv' \
| while read TYPE NAMESPACE; do
debug "Got event: $TYPE $NAMESPACE"
```
- `--watch` to avoid active-polling the control plane
- `--output-watch-events` to disregard e.g. resource deletion, edition
- `jq` to process JSON easily
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Resource ownership
- Check out the `kubectl patch` commands
- The created Namespace "owns" the corresponding ConfigMap and Secret
- This means that deleting the Namespace will delete the ConfigMap and Secret
- We don't need to watch for object deletion to clean up
- Clean up will we done automatically even if operator is not running
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Why no CRD?
- It's easier to create a ConfigMap
(e.g. `kubectl create configmap --from-literal=` one-liner)
- We don't need the features of CRDs
(schemas, printer columns, versioning...)
- “This CRD could have been a ConfigMap!”
(this doesn't mean *all* CRDs could be ConfigMaps, of course)
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
## Discussion
- A lot of simple, yet efficient logic, can be implemented in shell scripts
- These can be used to prototype more complex operators
- Not all use-cases require CRDs
(keep in mind that correct CRDs are *a lot* of work!)
- If the algorithms are correct, shell performance won't matter at all
(but it will be difficult to keep a resource cache in shell)
- Improvement idea: this operator could generate *events*
(visible with `kubectl get events` and `kubectl describe`)
???
:EN:- How to write a simple operator with shell scripts
:FR:- Comment écrire un opérateur simple en shell script
.debug[[k8s/operators-example.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/operators-example.md)]
---
class: pic
.interstitial[]
---
name: toc-kubebuilder
class: title
Kubebuilder
.nav[
[Previous part](#toc-writing-a-tiny-operator)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-sealed-secrets)
]
.debug[(automatically generated title slide)]
---
# Kubebuilder
- Writing a quick and dirty operator is (relatively) easy
- Doing it right, however ...
--
- We need:
- proper CRD with schema validation
- controller performing a reconcilation loop
- manage errors, retries, dependencies between resources
- maybe webhooks for admission and/or conversion
😱
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Frameworks
- There are a few frameworks available out there:
- [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder)
([book](https://book.kubebuilder.io/)):
go-centric, very close to Kubernetes' core types
- [operator-framework](https://operatorframework.io/):
higher level; also supports Ansible and Helm
- [KUDO](https://kudo.dev/):
declarative operators written in YAML
- [KOPF](https://kopf.readthedocs.io/en/latest/):
operators in Python
- ...
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Kubebuilder workflow
- Kubebuilder will create scaffolding for us
(Go stubs for types and controllers)
- Then we edit these types and controllers files
- Kubebuilder generates CRD manifests from our type definitions
(and regenerates the manifests whenver we update the types)
- It also gives us tools to quickly run the controller against a cluster
(not necessarily *on* the cluster)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Our objective
- We're going to implement a *useless machine*
[basic example](https://www.youtube.com/watch?v=aqAUmgE3WyM)
|
[playful example](https://www.youtube.com/watch?v=kproPsch7i0)
|
[advanced example](https://www.youtube.com/watch?v=Nqk_nWAjBus)
|
[another advanced example](https://www.youtube.com/watch?v=eLtUB8ncEnA)
- A machine manifest will look like this:
```yaml
kind: Machine
apiVersion: useless.container.training/v1alpha1
metadata:
name: machine-1
spec:
# Our useless operator will change that to "down"
switchPosition: up
```
- Each time we change the `switchPosition`, the operator will move it back to `down`
(This is inspired by the
[uselessoperator](https://github.com/tilt-dev/uselessoperator)
written by
[V Körbes](https://twitter.com/veekorbes).
Highly recommend!💯)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
class: extra-details
## Local vs remote
- Building Go code can be a little bit slow on our modest lab VMs
- It will typically be *much* faster on a local machine
- All the demos and labs in this section will run fine either way!
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Preparation
- Install Go
(on our VMs: `sudo snap install go --classic` or `sudo apk add go`)
- Install kubebuilder
([get a release](https://github.com/kubernetes-sigs/kubebuilder/releases/), untar, move the `kubebuilder` binary to the `$PATH`)
- Initialize our workspace:
```bash
mkdir useless
cd useless
go mod init container.training/useless
kubebuilder init --domain container.training
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Create scaffolding
- Create a type and corresponding controller:
```bash
kubebuilder create api --group useless --version v1alpha1 --kind Machine
```
- Answer `y` to both questions
- Then we need to edit the type that just got created!
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Edit type
Edit `api/v1alpha1/machine_types.go`.
Add the `switchPosition` field in the `spec` structure:
```go
// MachineSpec defines the desired state of Machine
type MachineSpec struct {
// Position of the switch on the machine, for instance up or down.
SwitchPosition string ``json:"switchPosition,omitempty"``
}
```
⚠️ The backticks above should be simple backticks, not double-backticks. Sorry.
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Go markers
We can use Go *marker comments* to give `controller-gen` extra details about how to handle our type, for instance:
```go
//+kubebuilder:object:root=true
```
→ top-level type exposed through API (as opposed to "member field of another type")
```go
//+kubebuilder:subresource:status
```
→ automatically generate a `status` subresource (very common with many types)
```go
//+kubebuilder:printcolumn:JSONPath=".spec.switchPosition",name=Position,type=string
```
(See
[marker syntax](https://book.kubebuilder.io/reference/markers.html),
[CRD generation](https://book.kubebuilder.io/reference/markers/crd.html),
[CRD validation](https://book.kubebuilder.io/reference/markers/crd-validation.html),
[Object/DeepCopy](https://master.book.kubebuilder.io/reference/markers/object.html)
)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Installing the CRD
After making these changes, we can run `make install`.
This will build the Go code, but also:
- generate the CRD manifest
- and apply the manifest to the cluster
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Creating a machine
Edit `config/samples/useless_v1alpha1_machine.yaml`:
```yaml
kind: Machine
apiVersion: useless.container.training/v1alpha1
metadata:
labels: # ...
name: machine-1
spec:
# Our useless operator will change that to "down"
switchPosition: up
```
... and apply it to the cluster.
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Designing the controller
- Our controller needs to:
- notice when a `switchPosition` is not `down`
- move it to `down` when that happens
- Later, we can add fancy improvements (wait a bit before moving it, etc.)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Reconciler logic
- Kubebuilder will call our *reconciler* when necessary
- When necessary = when changes happen ...
- on our resource
- or resources that it *watches* (related resources)
- After "doing stuff", the reconciler can return ...
- `ctrl.Result{},nil` = all is good
- `ctrl.Result{Requeue...},nil` = all is good, but call us back in a bit
- `ctrl.Result{},err` = something's wrong, try again later
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Loading an object
Open `internal/controllers/machine_controller.go`.
Add that code in the `Reconcile` method, at the `TODO(user)` location:
```go
var machine uselessv1alpha1.Machine
logger := log.FromContext(ctx)
if err := r.Get(ctx, req.NamespacedName, &machine); err != nil {
logger.Info("error getting object")
return ctrl.Result{}, err
}
logger.Info(
"reconciling",
"machine", req.NamespacedName,
"switchPosition", machine.Spec.SwitchPosition,
)
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Running the controller
Our controller is not done yet, but let's try what we have right now!
This will compile the controller and run it:
```
make run
```
Then:
- create a machine
- change the `switchPosition`
- delete the machine
--
We get a bunch of errors and go stack traces! 🤔
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## `IgnoreNotFound`
When we are called for object deletion, the object has *already* been deleted.
(Unless we're using finalizers, but that's another story.)
When we return `err`, the controller will try to access the object ...
... We need to tell it to *not* do that.
Don't just return `err`, but instead, wrap it around `client.IgnoreNotFound`:
```go
return ctrl.Result{}, client.IgnoreNotFound(err)
```
Update the code, `make run` again, create/change/delete again.
--
🎉
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Updating the machine
Let's try to update the machine like this:
```go
if machine.Spec.SwitchPosition != "down" {
machine.Spec.SwitchPosition = "down"
if err := r.Update(ctx, &machine); err != nil {
logger.Info("error updating switch position")
return ctrl.Result{}, client.IgnoreNotFound(err)
}
}
```
Again - update, `make run`, test.
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Spec vs Status
- Spec = desired state
- Status = observed state
- If Status is lost, the controller should be able to reconstruct it
(maybe with degraded behavior in the meantime)
- Status will almost always be a sub-resource, so that it can be updated separately
(and potentially with different permissions)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
class: extra-details
## Spec vs Status (in depth)
- The `/status` subresource is handled differently by the API server
- Updates to `/status` don't alter the rest of the object
- Conversely, updates to the object ignore changes in the status
(See [the docs](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresource) for the fine print.)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## "Improving" our controller
- We want to wait a few seconds before flipping the switch
- Let's add the following line of code to the controller:
```go
time.Sleep(5 * time.Second)
```
- `make run`, create a few machines, observe what happens
--
💡 Concurrency!
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Controller logic
- Our controller shouldn't block (think "event loop")
- There is a queue of objects that need to be reconciled
- We can ask to be put back on the queue for later processing
- When we need to block (wait for something to happen), two options:
- ask for a *requeue* ("call me back later")
- yield because we know we will be notified by another resource
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## To requeue ...
`return ctrl.Result{RequeueAfter: 1 * time.Second}, nil`
- That means: "try again in 1 second, and I will check if progress was made"
- This *does not* guarantee that we will be called exactly 1 second later:
- we might be called before (if other changes happen)
- we might be called after (if the controller is busy with other objects)
- If we are waiting for another Kubernetes resource to change, there is a better way
(explained on next slide)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## ... or not to requeue
`return ctrl.Result{}, nil`
- That means: "we're done here!"
- This is also what we should use if we are waiting for another resource
(e.g. a LoadBalancer to be provisioned, a Pod to be ready...)
- In that case, we will need to set a *watch* (more on that later)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Keeping track of state
- If we simply requeue the object to examine it 1 second later...
- ...We'll keep examining/requeuing it forever!
- We need to "remember" that we saw it (and when)
- Option 1: keep state in controller
(e.g. an internal `map`)
- Option 2: keep state in the object
(typically in its status field)
- Tradeoffs: concurrency / failover / control plane overhead...
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## "Improving" our controller, take 2
Let's store in the machine status the moment when we saw it:
```go
type MachineStatus struct {
// Time at which the machine was noticed by our controller.
SeenAt *metav1.Time ``json:"seenAt,omitempty"``
}
```
⚠️ The backticks above should be simple backticks, not double-backticks. Sorry.
Note: `date` fields don't display timestamps in the future.
(That's why for this example it's simpler to use `seenAt` rather than `changeAt`.)
And for better visibility, add this along with the other `printcolumn` comments:
```go
//+kubebuilder:printcolumn:JSONPath=".status.seenAt",name=Seen,type=date
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Set `seenAt`
Let's add the following block in our reconciler:
```go
if machine.Status.SeenAt == nil {
now := metav1.Now()
machine.Status.SeenAt = &now
if err := r.Status().Update(ctx, &machine); err != nil {
logger.Info("error updating status.seenAt")
return ctrl.Result{}, client.IgnoreNotFound(err)
}
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
```
(If needed, add `metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"` to our imports.)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Use `seenAt`
Our switch-position-changing code can now become:
```go
if machine.Spec.SwitchPosition != "down" {
now := metav1.Now()
changeAt := machine.Status.SeenAt.Time.Add(5 * time.Second)
if now.Time.After(changeAt) {
machine.Spec.SwitchPosition = "down"
machine.Status.SeenAt = nil
if err := r.Update(ctx, &machine); err != nil {
logger.Info("error updating switch position")
return ctrl.Result{}, client.IgnoreNotFound(err)
}
}
}
```
`make run`, create a few machines, tweak their switches.
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Owner and dependents
- Next, let's see how to have relationships between objects!
- We will now have two kinds of objects: machines, and switches
- Machines will store the number of switches in their spec
- Machines should have *at least* one switch, possibly *multiple ones*
- Our controller will automatically create switches if needed
(a bit like the ReplicaSet controller automatically creates Pods)
- The switches will be tied to their machine through a label
(let's pick `machine=name-of-the-machine`)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Switch state
- The position of a switch will now be stored in the switch
(not in the machine like in the first scenario)
- The machine will also expose the combined state of the switches
(through its status)
- The machine's status will be automatically updated by the controller
(each time a switch is added/changed/removed)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Switches and machines
```
[jp@hex ~]$ kubectl get machines
NAME SWITCHES POSITIONS
machine-cz2vl 3 ddd
machine-vf4xk 1 d
[jp@hex ~]$ kubectl get switches --show-labels
NAME POSITION SEEN LABELS
switch-6wmjw down machine=machine-cz2vl
switch-b8csg down machine=machine-cz2vl
switch-fl8dq down machine=machine-cz2vl
switch-rc59l down machine=machine-vf4xk
```
(The field `status.positions` shows the first letter of the `position` of each switch.)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Tasks
1. Create the new resource type (but don't create a controller)
2. Update `machine_types.go` and `switch_types.go`
3. Implement logic to display machine status (status of its switches)
4. Implement logic to automatically create switches
5. Implement logic to flip all switches down immediately
6. Then tweak it so that a given machine doesn't flip more than one switch every 5 seconds
*See next slides for detailed steps!*
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Creating the new type
```bash
kubebuilder create api --group useless --version v1alpha1 --kind Switch
```
Note: this time, only create a new custom resource; not a new controller.
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Updating our types
- Move the "switch position" and "seen at" to the new `Switch` type
- Update the `Machine` type to have:
- `spec.switches` (Go type: `int`, JSON type: `integer`)
- `status.positions` of type `string`
- Bonus points for adding [CRD Validation](https://book.kubebuilder.io/reference/markers/crd-validation.html) to the numbers of switches!
- Then install the new CRDs with `make install`
- Create a Machine, and a Switch linked to the Machine (by setting the `machine` label)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Listing switches
- Switches are associated to Machines with a label
(`kubectl label switch switch-xyz machine=machine-xyz`)
- We can retrieve associated switches like this:
```go
var switches uselessv1alpha1.SwitchList
if err := r.List(ctx, &switches,
client.InNamespace(req.Namespace),
client.MatchingLabels{"machine": req.Name},
); err != nil {
logger.Error(err, "unable to list switches of the machine")
return ctrl.Result{}, client.IgnoreNotFound(err)
}
logger.Info("Found switches", "switches", switches)
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Updating status
- Each time we reconcile a Machine, let's update its status:
```go
status := ""
for _, sw := range switches.Items {
status += string(sw.Spec.Position[0])
}
machine.Status.Positions = status
if err := r.Status().Update(ctx, &machine); err != nil {
...
```
- Run the controller and check that POSITIONS gets updated
- Add more switches linked to the same machine
- ...The POSITIONS don't get updated, unless we restart the controller
- We'll see later how to fix that!
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Creating objects
We can use the `Create` method to create a new object:
```go
sw := uselessv1alpha1.Switch{
TypeMeta: metav1.TypeMeta{
APIVersion: uselessv1alpha1.GroupVersion.String(),
Kind: "Switch",
},
ObjectMeta: metav1.ObjectMeta{
GenerateName: "switch-",
Namespace: machine.Namespace,
Labels: map[string]string{"machine": machine.Name},
},
Spec: uselessv1alpha1.SwitchSpec{
Position: "down",
},
}
if err := r.Create(ctx, &sw); err != nil { ...
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Create missing switches
- In our reconciler, if a machine doesn't have enough switches, create them!
- Option 1: directly create the number of missing switches
- Option 2: create only one switch (and rely on later requeuing)
- Note: option 2 won't quite work yet, since we haven't set up *watches* yet
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Watches
- Our controller doesn't react when switches are created/updated/deleted
- We need to tell it to watch switches
- We also need to tell it how to map a switch to its machine
(so that the correct machine gets queued and reconciled when a switch is updated)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Mapping a switch to its machine
Define the following helper function:
```go
func (r *MachineReconciler) machineOfSwitch(ctx context.Context, obj client.Object) []ctrl.Request {
return []ctrl.Request{
ctrl.Request{
NamespacedName: types.NamespacedName{
Name: obj.GetLabels()["machine"],
Namespace: obj.GetNamespace(),
},
},
}
}
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Telling the controller to watch switches
Update the `SetupWithManager` method in the controller:
```go
// SetupWithManager sets up the controller with the Manager.
func (r *MachineReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&uselessv1alpha1.Machine{}).
Owns(&uselessv1alpha1.Switch{}).
Watches(
&uselessv1alpha1.Switch{},
handler.EnqueueRequestsFromMapFunc(r.machineOfSwitch),
).
Complete(r)
}
```
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## ...And a few extra imports
Import the following packages referenced by the previous code:
```go
"sigs.k8s.io/controller-runtime/pkg/handler"
"sigs.k8s.io/controller-runtime/pkg/source"
"k8s.io/apimachinery/pkg/types"
```
After this, when we update a switch, it should reflect on the machine.
(Try to change switch positions and see the machine status update!)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Flipping switches
- Now re-add logic to flip switches that are not in "down" position
- Re-add logic to wait a few seconds before flipping a switch
- Change the logic to toggle one switch per machine every few seconds
(i.e. don't change all the switches for a machine; move them one at a time)
- Handle "scale down" of a machine (by deleting extraneous switches)
- Automatically delete switches when a machine is deleted
(ideally, using ownership information)
- Test corner cases (e.g. changing a switch label)
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Other possible improvements
- Formalize resource ownership
(by setting `ownerReferences` in the switches)
- This can simplify the watch mechanism a bit
- Allow to define a selector
(instead of using the hard-coded `machine` label)
- And much more!
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
## Acknowledgements
- Useless Operator, by [V Körbes](https://twitter.com/veekorbes)
[code](https://github.com/tilt-dev/uselessoperator)
|
[video (EN)](https://www.youtube.com/watch?v=85dKpsFFju4)
|
[video (PT)](https://www.youtube.com/watch?v=Vt7Eg4wWNDw)
- Zero To Operator, by [Solly Ross](https://twitter.com/directxman12)
[code](https://pres.metamagical.dev/kubecon-us-2019/code)
|
[video](https://www.youtube.com/watch?v=KBTXBUVNF2I)
|
[slides](https://pres.metamagical.dev/kubecon-us-2019/)
- The [kubebuilder book](https://book.kubebuilder.io/)
???
:EN:- Implementing an operator with kubebuilder
:FR:- Implémenter un opérateur avec kubebuilder
.debug[[k8s/kubebuilder.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kubebuilder.md)]
---
class: pic
.interstitial[]
---
name: toc-sealed-secrets
class: title
Sealed Secrets
.nav[
[Previous part](#toc-kubebuilder)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-policy-management-with-kyverno)
]
.debug[(automatically generated title slide)]
---
# Sealed Secrets
- Kubernetes provides the "Secret" resource to store credentials, keys, passwords ...
- Secrets can be protected with RBAC
(e.g. "you can write secrets, but only the app's service account can read them")
- [Sealed Secrets](https://github.com/bitnami-labs/sealed-secrets) is an operator that lets us store secrets in code repositories
- It uses asymetric cryptography:
- anyone can *encrypt* a secret
- only the cluster can *decrypt* a secret
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Principle
- The Sealed Secrets operator uses a *public* and a *private* key
- The public key is available publicly (duh!)
- We use the public key to encrypt secrets into a SealedSecret resource
- the SealedSecret resource can be stored in a code repo (even a public one)
- The SealedSecret resource is `kubectl apply`'d to the cluster
- The Sealed Secrets controller decrypts the SealedSecret with the private key
(this creates a classic Secret resource)
- Nobody else can decrypt secrets, since only the controller has the private key
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## In action
- We will install the Sealed Secrets operator
- We will generate a Secret
- We will "seal" that Secret (generate a SealedSecret)
- We will load that SealedSecret on the cluster
- We will check that we now have a Secret
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Installing the operator
- The official installation is done through a single YAML file
- There is also a Helm chart if you prefer that (see next slide!)
.lab[
- Install the operator:
.small[
```bash
kubectl apply -f \
https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.17.5/controller.yaml
```
]
]
Note: it installs into `kube-system` by default.
If you change that, you will also need to inform `kubeseal` later on.
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
class: extra-details
## Installing with Helm
- The Sealed Secrets controller can be installed like this:
```bash
helm install --repo https://bitnami-labs.github.io/sealed-secrets/ \
sealed-secrets-controller sealed-secrets --namespace kube-system
```
- Make sure to install in the `kube-system` Namespace
- Make sure that the release is named `sealed-secrets-controller`
(or pass a `--controller-name` option to `kubeseal` later)
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Creating a Secret
- Let's create a normal (unencrypted) secret
.lab[
- Create a Secret with a couple of API tokens:
```bash
kubectl create secret generic awskey \
--from-literal=AWS_ACCESS_KEY_ID=AKI... \
--from-literal=AWS_SECRET_ACCESS_KEY=abc123xyz... \
--dry-run=client -o yaml > secret-aws.yaml
```
]
- Note the `--dry-run` and `-o yaml`
(we're just generating YAML, not sending the secrets to our Kubernetes cluster)
- We could also write the YAML from scratch or generate it with other tools
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Creating a Sealed Secret
- This is done with the `kubeseal` tool
- It will obtain the public key from the cluster
.lab[
- Create the Sealed Secret:
```bash
kubeseal < secret-aws.yaml > sealed-secret-aws.json
```
]
- The file `sealed-secret-aws.json` can be committed to your public repo
(if you prefer YAML output, you can add `-o yaml`)
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Using a Sealed Secret
- Now let's `kubectl apply` that Sealed Secret to the cluster
- The Sealed Secret controller will "unseal" it for us
.lab[
- Check that our Secret doesn't exist (yet):
```bash
kubectl get secrets
```
- Load the Sealed Secret into the cluster:
```bash
kubectl create -f sealed-secret-aws.json
```
- Check that the secret is now available:
```bash
kubectl get secrets
```
]
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Tweaking secrets
- Let's see what happens if we try to rename the Secret
(or use it in a different namespace)
.lab[
- Delete both the Secret and the SealedSecret
- Edit `sealed-secret-aws.json`
- Change the name of the secret, or its namespace
(both in the SealedSecret metadata and in the Secret template)
- `kubectl apply -f` the new JSON file and observe the results 🤔
]
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Sealed Secrets are *scoped*
- A SealedSecret cannot be renamed or moved to another namespace
(at least, not by default!)
- Otherwise, it would allow to evade RBAC rules:
- if I can view Secrets in namespace `myapp` but not in namespace `yourapp`
- I could take a SealedSecret belonging to namespace `yourapp`
- ... and deploy it in `myapp`
- ... and view the resulting decrypted Secret!
- This can be changed with `--scope namespace-wide` or `--scope cluster-wide`
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Working offline
- We can obtain the public key from the server
(technically, as a PEM certificate)
- Then we can use that public key offline
(without contacting the server)
- Relevant commands:
`kubeseal --fetch-cert > seal.pem`
`kubeseal --cert seal.pem < secret.yaml > sealedsecret.json`
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Key rotation
- The controller generate new keys every month by default
- The keys are kept as TLS Secrets in the `kube-system` namespace
(named `sealed-secrets-keyXXXXX`)
- When keys are "rotated", old decryption keys are kept
(otherwise we can't decrypt previously-generated SealedSecrets)
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Key compromise
- If the *sealing* key (obtained with `--fetch-cert` is compromised):
*we don't need to do anything (it's a public key!)*
- However, if the *unsealing* key (the TLS secret in `kube-system`) is compromised ...
*we need to:*
- rotate the key
- rotate the SealedSecrets that were encrypted with that key
(as they are compromised)
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Rotating the key
- By default, new keys are generated every 30 days
- To force the generation of a new key "right now":
- obtain an RFC1123 timestamp with `date -R`
- edit Deployment `sealed-secrets-controller` (in `kube-system`)
- add `--key-cutoff-time=TIMESTAMP` to the command-line
- *Then*, rotate the SealedSecrets that were encrypted with it
(generate new Secrets, then encrypt them with the new key)
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Discussion (the good)
- The footprint of the operator is rather small:
- only one CRD
- one Deployment, one Service
- a few RBAC-related objects
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Discussion (the less good)
- Events could be improved
- `no key to decrypt secret` when there is a name/namespace mismatch
- no event indicating that a SealedSecret was successfully unsealed
- Key rotation could be improved (how to find secrets corresponding to a key?)
- If the sealing keys are lost, it's impossible to unseal the SealedSecrets
(e.g. cluster reinstall)
- ... Which means that we need to back up the sealing keys
- ... Which means that we need to be super careful with these backups!
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
## Other approaches
- [Kamus](https://kamus.soluto.io/) ([git](https://github.com/Soluto/kamus)) offers "zero-trust" secrets
(the cluster cannot decrypt secrets; only the application can decrypt them)
- [Vault](https://learn.hashicorp.com/tutorials/vault/kubernetes-sidecar?in=vault/kubernetes) can do ... a lot
- dynamic secrets (generated on the fly for a consumer)
- certificate management
- integration outside of Kubernetes
- and much more!
???
:EN:- The Sealed Secrets Operator
:FR:- L'opérateur *Sealed Secrets*
.debug[[k8s/sealed-secrets.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/sealed-secrets.md)]
---
class: pic
.interstitial[]
---
name: toc-policy-management-with-kyverno
class: title
Policy Management with Kyverno
.nav[
[Previous part](#toc-sealed-secrets)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-an-elasticsearch-operator)
]
.debug[(automatically generated title slide)]
---
# Policy Management with Kyverno
- The Kubernetes permission management system is very flexible ...
- ... But it can't express *everything!*
- Examples:
- forbid using `:latest` image tag
- enforce that each Deployment, Service, etc. has an `owner` label
(except in e.g. `kube-system`)
- enforce that each container has at least a `readinessProbe` healthcheck
- How can we address that, and express these more complex *policies?*
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Admission control
- The Kubernetes API server provides a generic mechanism called *admission control*
- Admission controllers will examine each write request, and can:
- approve/deny it (for *validating* admission controllers)
- additionally *update* the object (for *mutating* admission controllers)
- These admission controllers can be:
- plug-ins built into the Kubernetes API server
(selectively enabled/disabled by e.g. command-line flags)
- webhooks registered dynamically with the Kubernetes API server
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## What's Kyverno?
- Policy management solution for Kubernetes
- Open source (https://github.com/kyverno/kyverno/)
- Compatible with all clusters
(doesn't require to reconfigure the control plane, enable feature gates...)
- We don't endorse / support it in a particular way, but we think it's cool
- It's not the only solution!
(see e.g. [Open Policy Agent](https://www.openpolicyagent.org/docs/v0.12.2/kubernetes-admission-control/) or [Validating Admission Policies](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/))
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## What can Kyverno do?
- *Validate* resource manifests
(accept/deny depending on whether they conform to our policies)
- *Mutate* resources when they get created or updated
(to add/remove/change fields on the fly)
- *Generate* additional resources when a resource gets created
(e.g. when namespace is created, automatically add quotas and limits)
- *Audit* existing resources
(warn about resources that violate certain policies)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## How does it do it?
- Kyverno is implemented as a *controller* or *operator*
- It typically runs as a Deployment on our cluster
- Policies are defined as *custom resource definitions*
- They are implemented with a set of *dynamic admission control webhooks*
--
🤔
--
- Let's unpack that!
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Custom resource definitions
- When we install Kyverno, it will register new resource types:
- Policy and ClusterPolicy (per-namespace and cluster-scope policies)
- PolicyReport and ClusterPolicyReport (used in audit mode)
- GenerateRequest (used internally when generating resources asynchronously)
- We will be able to do e.g. `kubectl get clusterpolicyreports --all-namespaces`
(to see policy violations across all namespaces)
- Policies will be defined in YAML and registered/updated with e.g. `kubectl apply`
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Dynamic admission control webhooks
- When we install Kyverno, it will register a few webhooks for its use
(by creating ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources)
- All subsequent resource modifications are submitted to these webhooks
(creations, updates, deletions)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Controller
- When we install Kyverno, it creates a Deployment (and therefore, a Pod)
- That Pod runs the server used by the webhooks
- It also runs a controller that will:
- run checks in the background (and generate PolicyReport objects)
- process GenerateRequest objects asynchronously
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Kyverno in action
- We're going to install Kyverno on our cluster
- Then, we will use it to implement a few policies
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Installing Kyverno
The recommended [installation method][install-kyverno] is to use Helm charts.
(It's also possible to install with a single YAML manifest.)
.lab[
- Install Kyverno:
```bash
helm upgrade --install --repo https://kyverno.github.io/kyverno/ \
--namespace kyverno --create-namespace kyverno kyverno
```
]
[install-kyverno]: https://kyverno.io/docs/installation/methods/
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Kyverno policies in a nutshell
- Which resources does it *select?*
- can specify resources to *match* and/or *exclude*
- can specify *kinds* and/or *selector* and/or users/roles doing the action
- Which operation should be done?
- validate, mutate, or generate
- For validation, whether it should *enforce* or *audit* failures
- Operation details (what exactly to validate, mutate, or generate)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Painting pods
- As an example, we'll implement a policy regarding "Pod color"
- The color of a Pod is the value of the label `color`
- Example: `kubectl label pod hello color=yellow` to paint a Pod in yellow
- We want to implement the following policies:
- color is optional (i.e. the label is not required)
- if color is set, it *must* be `red`, `green`, or `blue`
- once the color has been set, it cannot be changed
- once the color has been set, it cannot be removed
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Immutable primary colors, take 1
- First, we will add a policy to block forbidden colors
(i.e. only allow `red`, `green`, or `blue`)
- One possible approach:
- *match* all pods that have a `color` label that is not `red`, `green`, or `blue`
- *deny* these pods
- We could also *match* all pods, then *deny* with a condition
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
.small[
```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: pod-color-policy-1
spec:
validationFailureAction: Enforce
rules:
- name: ensure-pod-color-is-valid
match:
resources:
kinds:
- Pod
selector:
matchExpressions:
- key: color
operator: Exists
- key: color
operator: NotIn
values: [ red, green, blue ]
validate:
message: "If it exists, the label color must be red, green, or blue."
deny: {}
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Testing without the policy
- First, let's create a pod with an "invalid" label
(while we still can!)
- We will use this later
.lab[
- Create a pod:
```bash
kubectl run test-color-0 --image=nginx
```
- Apply a color label:
```bash
kubectl label pod test-color-0 color=purple
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Load and try the policy
.lab[
- Load the policy:
```bash
kubectl apply -f ~/container.training/k8s/kyverno-pod-color-1.yaml
```
- Create a pod:
```bash
kubectl run test-color-1 --image=nginx
```
- Try to apply a few color labels:
```bash
kubectl label pod test-color-1 color=purple
kubectl label pod test-color-1 color=red
kubectl label pod test-color-1 color-
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Immutable primary colors, take 2
- Next rule: once a `color` label has been added, it cannot be changed
(i.e. if `color=red`, we can't change it to `color=blue`)
- Our approach:
- *match* all pods
- add a *precondition* matching pods that have a `color` label
(both in their "before" and "after" states)
- *deny* these pods if their `color` label has changed
- Again, other approaches are possible!
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
.small[
```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: pod-color-policy-2
spec:
validationFailureAction: Enforce
background: false
rules:
- name: prevent-color-change
match:
resources:
kinds:
- Pod
preconditions:
- key: "{{ request.operation }}"
operator: Equals
value: UPDATE
- key: "{{ request.oldObject.metadata.labels.color || '' }}"
operator: NotEquals
value: ""
- key: "{{ request.object.metadata.labels.color || '' }}"
operator: NotEquals
value: ""
validate:
message: "Once label color has been added, it cannot be changed."
deny:
conditions:
- key: "{{ request.object.metadata.labels.color }}"
operator: NotEquals
value: "{{ request.oldObject.metadata.labels.color }}"
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Comparing "old" and "new"
- The fields of the webhook payload are available through `{{ request }}`
- For UPDATE requests, we can access:
`{{ request.oldObject }}` → the object as it is right now (before the request)
`{{ request.object }}` → the object with the changes made by the request
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Missing labels
- We can access the `color` label through `{{ request.object.metadata.labels.color }}`
- If we reference a label (or any field) that doesn't exist, the policy fails
(with an error similar to `JMESPAth query failed: Unknown key ... in path`)
- To work around that, [use an OR expression][non-existence-checks]:
`{{ requests.object.metadata.labels.color || '' }}`
- Note that in older versions of Kyverno, this wasn't always necessary
(e.g. in *preconditions*, a missing label would evalute to an empty string)
[non-existence-checks]: https://kyverno.io/docs/writing-policies/jmespath/#non-existence-checks
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Load and try the policy
.lab[
- Load the policy:
```bash
kubectl apply -f ~/container.training/k8s/kyverno-pod-color-2.yaml
```
- Create a pod:
```bash
kubectl run test-color-2 --image=nginx
```
- Try to apply a few color labels:
```bash
kubectl label pod test-color-2 color=purple
kubectl label pod test-color-2 color=red
kubectl label pod test-color-2 color=blue --overwrite
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## `background`
- What is this `background: false` option, and why do we need it?
--
- Admission controllers are only invoked when we change an object
- Existing objects are not affected
(e.g. if we have a pod with `color=pink` *before* installing our policy)
- Kyvero can also run checks in the background, and report violations
(we'll see later how they are reported)
- `background: false` disables that
--
- Alright, but ... *why* do we need it?
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Accessing `AdmissionRequest` context
- In this specific policy, we want to prevent an *update*
(as opposed to a mere *create* operation)
- We want to compare the *old* and *new* version
(to check if a specific label was removed)
- The `AdmissionRequest` object has `object` and `oldObject` fields
(the `AdmissionRequest` object is the thing that gets submitted to the webhook)
- We access the `AdmissionRequest` object through `{{ request }}`
--
- Alright, but ... what's the link with `background: false`?
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## `{{ request }}`
- The `{{ request }}` context is only available when there is an `AdmissionRequest`
- When a resource is "at rest", there is no `{{ request }}` (and no old/new)
- Therefore, a policy that uses `{{ request }}` cannot validate existing objects
(it can only be used when an object is actually created/updated/deleted)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Immutable primary colors, take 3
- Last rule: once a `color` label has been added, it cannot be removed
- Our approach is to match all pods that:
- *had* a `color` label (in `request.oldObject`)
- *don't have* a `color` label (in `request.Object`)
- And *deny* these pods
- Again, other approaches are possible!
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
.small[
```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: pod-color-policy-3
spec:
validationFailureAction: Enforce
background: false
rules:
- name: prevent-color-change
match:
resources:
kinds:
- Pod
preconditions:
- key: "{{ request.operation }}"
operator: Equals
value: UPDATE
- key: "{{ request.oldObject.metadata.labels.color || '' }}"
operator: NotEquals
value: ""
- key: "{{ request.object.metadata.labels.color || '' }}"
operator: Equals
value: ""
validate:
message: "Once label color has been added, it cannot be removed."
deny:
conditions:
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Load and try the policy
.lab[
- Load the policy:
```bash
kubectl apply -f ~/container.training/k8s/kyverno-pod-color-3.yaml
```
- Create a pod:
```bash
kubectl run test-color-3 --image=nginx
```
- Try to apply a few color labels:
```bash
kubectl label pod test-color-3 color=purple
kubectl label pod test-color-3 color=red
kubectl label pod test-color-3 color-
```
]
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Background checks
- What about the `test-color-0` pod that we create initially?
(remember: we did set `color=purple`)
- We can see the infringing Pod in a PolicyReport
.lab[
- Check that the pod still an "invalid" color:
```bash
kubectl get pods -L color
```
- List PolicyReports:
```bash
kubectl get policyreports
kubectl get polr
```
]
(Sometimes it takes a little while for the infringement to show up, though.)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Generating objects
- When we create a Namespace, we also want to automatically create:
- a LimitRange (to set default CPU and RAM requests and limits)
- a ResourceQuota (to limit the resources used by the namespace)
- a NetworkPolicy (to isolate the namespace)
- We can do that with a Kyverno policy with a *generate* action
(it is mutually exclusive with the *validate* action)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Overview
- The *generate* action must specify:
- the `kind` of resource to generate
- the `name` of the resource to generate
- its `namespace`, when applicable
- *either* a `data` structure, to be used to populate the resource
- *or* a `clone` reference, to copy an existing resource
Note: the `apiVersion` field appears to be optional.
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## In practice
- We will use the policy [k8s/kyverno-namespace-setup.yaml](https://github.com/jpetazzo/container.training/tree/master/k8s/kyverno-namespace-setup.yaml)
- We need to generate 3 resources, so we have 3 rules in the policy
- Excerpt:
```yaml
generate:
kind: LimitRange
name: default-limitrange
namespace: "{{request.object.metadata.name}}"
data:
spec:
limits:
```
- Note that we have to specify the `namespace`
(and we infer it from the name of the resource being created, i.e. the Namespace)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Lifecycle
- After generated objects have been created, we can change them
(Kyverno won't update them)
- Except if we use `clone` together with the `synchronize` flag
(in that case, Kyverno will watch the cloned resource)
- This is convenient for e.g. ConfigMaps shared between Namespaces
- Objects are generated only at *creation* (not when updating an old object)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
class: extra-details
## Managing `ownerReferences`
- By default, the generated object and triggering object have independent lifecycles
(deleting the triggering object doesn't affect the generated object)
- It is possible to associate the generated object with the triggering object
(so that deleting the triggering object also deletes the generated object)
- This is done by adding the triggering object information to `ownerReferences`
(in the generated object `metadata`)
- See [Linking resources with ownerReferences][ownerref] for an example
[ownerref]: https://kyverno.io/docs/writing-policies/generate/#linking-trigger-with-downstream
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Asynchronous creation
- Kyverno creates resources asynchronously
(by creating a GenerateRequest resource first)
- This is useful when the resource cannot be created
(because of permissions or dependency issues)
- Kyverno will periodically loop through the pending GenerateRequests
- Once the ressource is created, the GenerateRequest is marked as Completed
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Footprint
- 8 CRDs
- 5 webhooks
- 2 Services, 1 Deployment, 2 ConfigMaps
- Internal resources (GenerateRequest) "parked" in a Namespace
- Kyverno packs a lot of features in a small footprint
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Strengths
- Kyverno is very easy to install
(it's harder to get easier than one `kubectl apply -f`)
- The setup of the webhooks is fully automated
(including certificate generation)
- It offers both namespaced and cluster-scope policies
- The policy language leverages existing constructs
(e.g. `matchExpressions`)
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
## Caveats
- The `{{ request }}` context is powerful, but difficult to validate
(Kyverno can't know ahead of time how it will be populated)
- Advanced policies (with conditionals) have unique, exotic syntax:
```yaml
spec:
=(volumes):
=(hostPath):
path: "!/var/run/docker.sock"
```
- Writing and validating policies can be difficult
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
class: extra-details
## Pods created by controllers
- When e.g. a ReplicaSet or DaemonSet creates a pod, it "owns" it
(the ReplicaSet or DaemonSet is listed in the Pod's `.metadata.ownerReferences`)
- Kyverno treats these Pods differently
- If my understanding of the code is correct (big *if*):
- it skips validation for "owned" Pods
- instead, it validates their controllers
- this way, Kyverno can report errors on the controller instead of the pod
- This can be a bit confusing when testing policies on such pods!
???
:EN:- Policy Management with Kyverno
:FR:- Gestion de *policies* avec Kyverno
.debug[[k8s/kyverno.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/kyverno.md)]
---
class: pic
.interstitial[]
---
name: toc-an-elasticsearch-operator
class: title
An ElasticSearch Operator
.nav[
[Previous part](#toc-policy-management-with-kyverno)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-finalizers)
]
.debug[(automatically generated title slide)]
---
# An ElasticSearch Operator
- We will install [Elastic Cloud on Kubernetes](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html), an ElasticSearch operator
- This operator requires PersistentVolumes
- We will install Rancher's [local path storage provisioner](https://github.com/rancher/local-path-provisioner) to automatically create these
- Then, we will create an ElasticSearch resource
- The operator will detect that resource and provision the cluster
- We will integrate that ElasticSearch cluster with other resources
(Kibana, Filebeat, Cerebro ...)
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Installing a Persistent Volume provisioner
(This step can be skipped if you already have a dynamic volume provisioner.)
- This provisioner creates Persistent Volumes backed by `hostPath`
(local directories on our nodes)
- It doesn't require anything special ...
- ... But losing a node = losing the volumes on that node!
.lab[
- Install the local path storage provisioner:
```bash
kubectl apply -f ~/container.training/k8s/local-path-storage.yaml
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Making sure we have a default StorageClass
- The ElasticSearch operator will create StatefulSets
- These StatefulSets will instantiate PersistentVolumeClaims
- These PVCs need to be explicitly associated with a StorageClass
- Or we need to tag a StorageClass to be used as the default one
.lab[
- List StorageClasses:
```bash
kubectl get storageclasses
```
]
We should see the `local-path` StorageClass.
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Setting a default StorageClass
- This is done by adding an annotation to the StorageClass:
`storageclass.kubernetes.io/is-default-class: true`
.lab[
- Tag the StorageClass so that it's the default one:
```bash
kubectl annotate storageclass local-path \
storageclass.kubernetes.io/is-default-class=true
```
- Check the result:
```bash
kubectl get storageclasses
```
]
Now, the StorageClass should have `(default)` next to its name.
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Install the ElasticSearch operator
- The operator provides:
- a few CustomResourceDefinitions
- a Namespace for its other resources
- a ValidatingWebhookConfiguration for type checking
- a StatefulSet for its controller and webhook code
- a ServiceAccount, ClusterRole, ClusterRoleBinding for permissions
- All these resources are grouped in a convenient YAML file
.lab[
- Install the operator:
```bash
kubectl apply -f ~/container.training/k8s/eck-operator.yaml
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Check our new custom resources
- Let's see which CRDs were created
.lab[
- List all CRDs:
```bash
kubectl get crds
```
]
This operator supports ElasticSearch, but also Kibana and APM. Cool!
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Create the `eck-demo` namespace
- For clarity, we will create everything in a new namespace, `eck-demo`
- This namespace is hard-coded in the YAML files that we are going to use
- We need to create that namespace
.lab[
- Create the `eck-demo` namespace:
```bash
kubectl create namespace eck-demo
```
- Switch to that namespace:
```bash
kns eck-demo
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
class: extra-details
## Can we use a different namespace?
Yes, but then we need to update all the YAML manifests that we
are going to apply in the next slides.
The `eck-demo` namespace is hard-coded in these YAML manifests.
Why?
Because when defining a ClusterRoleBinding that references a
ServiceAccount, we have to indicate in which namespace the
ServiceAccount is located.
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Create an ElasticSearch resource
- We can now create a resource with `kind: ElasticSearch`
- The YAML for that resource will specify all the desired parameters:
- how many nodes we want
- image to use
- add-ons (kibana, cerebro, ...)
- whether to use TLS or not
- etc.
.lab[
- Create our ElasticSearch cluster:
```bash
kubectl apply -f ~/container.training/k8s/eck-elasticsearch.yaml
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Operator in action
- Over the next minutes, the operator will create our ES cluster
- It will report our cluster status through the CRD
.lab[
- Check the logs of the operator:
```bash
stern --namespace=elastic-system operator
```
- Watch the status of the cluster through the CRD:
```bash
kubectl get es -w
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Connecting to our cluster
- It's not easy to use the ElasticSearch API from the shell
- But let's check at least if ElasticSearch is up!
.lab[
- Get the ClusterIP of our ES instance:
```bash
kubectl get services
```
- Issue a request with `curl`:
```bash
curl http://`CLUSTERIP`:9200
```
]
We get an authentication error. Our cluster is protected!
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Obtaining the credentials
- The operator creates a user named `elastic`
- It generates a random password and stores it in a Secret
.lab[
- Extract the password:
```bash
kubectl get secret demo-es-elastic-user \
-o go-template="{{ .data.elastic | base64decode }} "
```
- Use it to connect to the API:
```bash
curl -u elastic:`PASSWORD` http://`CLUSTERIP`:9200
```
]
We should see a JSON payload with the `"You Know, for Search"` tagline.
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Sending data to the cluster
- Let's send some data to our brand new ElasticSearch cluster!
- We'll deploy a filebeat DaemonSet to collect node logs
.lab[
- Deploy filebeat:
```bash
kubectl apply -f ~/container.training/k8s/eck-filebeat.yaml
```
- Wait until some pods are up:
```bash
watch kubectl get pods -l k8s-app=filebeat
```
- Check that a filebeat index was created:
```bash
curl -u elastic:`PASSWORD` http://`CLUSTERIP`:9200/_cat/indices
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Deploying an instance of Kibana
- Kibana can visualize the logs injected by filebeat
- The ECK operator can also manage Kibana
- Let's give it a try!
.lab[
- Deploy a Kibana instance:
```bash
kubectl apply -f ~/container.training/k8s/eck-kibana.yaml
```
- Wait for it to be ready:
```bash
kubectl get kibana -w
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Connecting to Kibana
- Kibana is automatically set up to conect to ElasticSearch
(this is arranged by the YAML that we're using)
- However, it will ask for authentication
- It's using the same user/password as ElasticSearch
.lab[
- Get the NodePort allocated to Kibana:
```bash
kubectl get services
```
- Connect to it with a web browser
- Use the same user/password as before
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Setting up Kibana
After the Kibana UI loads, we need to click around a bit
.lab[
- Pick "explore on my own"
- Click on Use Elasticsearch data / Connect to your Elasticsearch index"
- Enter `filebeat-*` for the index pattern and click "Next step"
- Select `@timestamp` as time filter field name
- Click on "discover" (the small icon looking like a compass on the left bar)
- Play around!
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Scaling up the cluster
- At this point, we have only one node
- We are going to scale up
- But first, we'll deploy Cerebro, an UI for ElasticSearch
- This will let us see the state of the cluster, how indexes are sharded, etc.
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Deploying Cerebro
- Cerebro is stateless, so it's fairly easy to deploy
(one Deployment + one Service)
- However, it needs the address and credentials for ElasticSearch
- We prepared yet another manifest for that!
.lab[
- Deploy Cerebro:
```bash
kubectl apply -f ~/container.training/k8s/eck-cerebro.yaml
```
- Lookup the NodePort number and connect to it:
```bash
kubectl get services
```
]
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
## Scaling up the cluster
- We can see on Cerebro that the cluster is "yellow"
(because our index is not replicated)
- Let's change that!
.lab[
- Edit the ElasticSearch cluster manifest:
```bash
kubectl edit es demo
```
- Find the field `count: 1` and change it to 3
- Save and quit
]
???
:EN:- Deploying ElasticSearch with ECK
:FR:- Déployer ElasticSearch avec ECK
.debug[[k8s/eck.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/eck.md)]
---
class: pic
.interstitial[]
---
name: toc-finalizers
class: title
Finalizers
.nav[
[Previous part](#toc-an-elasticsearch-operator)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-owners-and-dependents)
]
.debug[(automatically generated title slide)]
---
# Finalizers
- Sometimes, we.red[¹] want to prevent a resource from being deleted:
- perhaps it's "precious" (holds important data)
- perhaps other resources depend on it (and should be deleted first)
- perhaps we need to perform some clean up before it's deleted
- *Finalizers* are a way to do that!
.footnote[.red[¹]The "we" in that sentence generally stands for a controller.
(We can also use finalizers directly ourselves, but it's not very common.)]
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Examples
- Prevent deletion of a PersistentVolumeClaim which is used by a Pod
- Prevent deletion of a PersistentVolume which is bound to a PersistentVolumeClaim
- Prevent deletion of a Namespace that still contains objects
- When a LoadBalancer Service is deleted, make sure that the corresponding external resource (e.g. NLB, GLB, etc.) gets deleted.red[¹]
- When a CRD gets deleted, make sure that all the associated resources get deleted.red[²]
.footnote[.red[¹²]Finalizers are not the only solution for these use-cases.]
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## How do they work?
- Each resource can have list of `finalizers` in its `metadata`, e.g.:
```yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: my-pvc
annotations: ...
finalizers:
- kubernetes.io/pvc-protection
```
- If we try to delete an resource that has at least one finalizer:
- the resource is *not* deleted
- instead, its `deletionTimestamp` is set to the current time
- we are merely *marking the resource for deletion*
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## What happens next?
- The controller that added the finalizer is supposed to:
- watch for resources with a `deletionTimestamp`
- execute necessary clean-up actions
- then remove the finalizer
- The resource is deleted once all the finalizers have been removed
(there is no timeout, so this could take forever)
- Until then, the resource can be used normally
(but no further finalizer can be *added* to the resource)
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Finalizers in review
Let's review the examples mentioned earlier.
For each of them, we'll see if there are other (perhaps better) options.
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Volume finalizer
- Kubernetes applies the following finalizers:
- `kubernetes.io/pvc-protection` on PersistentVolumeClaims
- `kubernetes.io/pv-protection` on PersistentVolumes
- This prevents removing them when they are in use
- Implementation detail: the finalizer is present *even when the resource is not in use*
- When the resource is ~~deleted~~ marked for deletion, the controller will check if the finalizer can be removed
(Perhaps to avoid race conditions?)
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Namespace finalizer
- Kubernetes applies a finalizer named `kubernetes`
- It prevents removing the namespace if it still contains objects
- *Can we remove the namespace anyway?*
- remove the finalizer
- delete the namespace
- force deletion
- It *seems to works* but, in fact, the objects in the namespace still exist
(and they will re-appear if we re-create the namespace)
See [this blog post](https://www.openshift.com/blog/the-hidden-dangers-of-terminating-namespaces) for more details about this.
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## LoadBalancer finalizer
- Scenario:
We run a custom controller to implement provisioning of LoadBalancer Services.
When a Service with type=LoadBalancer is deleted, we want to make sure
that the corresponding external resources are properly deleted.
- Rationale for using a finalizer:
Normally, we would watch and observe the deletion of the Service;
but if the Service is deleted while our controller is down,
we could "miss" the deletion and forget to clean up the external resource.
The finalizer ensures that we will "see" the deletion
and clean up the external resource.
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Counterpoint
- We could also:
- Tag the external resources
(to indicate which Kubernetes Service they correspond to)
- Periodically reconcile them against Kubernetes resources
- If a Kubernetes resource does no longer exist, delete the external resource
- This doesn't have to be a *pre-delete* hook
(unless we store important information in the Service, e.g. as annotations)
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## CRD finalizer
- Scenario:
We have a CRD that represents a PostgreSQL cluster.
It provisions StatefulSets, Deployments, Services, Secrets, ConfigMaps.
When the CRD is deleted, we want to delete all these resources.
- Rationale for using a finalizer:
Same as previously; we could observe the CRD, but if it is deleted
while the controller isn't running, we would miss the deletion,
and the other resources would keep running.
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Counterpoint
- We could use the same technique as described before
(tag the resources with e.g. annotations, to associate them with the CRD)
- Even better: we could use `ownerReferences`
(this feature is *specifically* designed for that use-case!)
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## CRD finalizer (take two)
- Scenario:
We have a CRD that represents a PostgreSQL cluster.
It provisions StatefulSets, Deployments, Services, Secrets, ConfigMaps.
When the CRD is deleted, we want to delete all these resources.
We also want to store a final backup of the database.
We also want to update final usage metrics (e.g. for billing purposes).
- Rationale for using a finalizer:
We need to take some actions *before* the resources get deleted, not *after*.
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
## Wrapping up
- Finalizers are a great way to:
- prevent deletion of a resource that is still in use
- have a "guaranteed" pre-delete hook
- They can also be (ab)used for other purposes
- Code spelunking exercise:
*check where finalizers are used in the Kubernetes code base and why!*
???
:EN:- Using "finalizers" to manage resource lifecycle
:FR:- Gérer le cycle de vie des ressources avec les *finalizers*
.debug[[k8s/finalizers.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/finalizers.md)]
---
class: pic
.interstitial[]
---
name: toc-owners-and-dependents
class: title
Owners and dependents
.nav[
[Previous part](#toc-finalizers)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-events)
]
.debug[(automatically generated title slide)]
---
# Owners and dependents
- Some objects are created by other objects
(example: pods created by replica sets, themselves created by deployments)
- When an *owner* object is deleted, its *dependents* are deleted
(this is the default behavior; it can be changed)
- We can delete a dependent directly if we want
(but generally, the owner will recreate another right away)
- An object can have multiple owners
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## Finding out the owners of an object
- The owners are recorded in the field `ownerReferences` in the `metadata` block
.lab[
- Let's create a deployment running `nginx`:
```bash
kubectl create deployment yanginx --image=nginx
```
- Scale it to a few replicas:
```bash
kubectl scale deployment yanginx --replicas=3
```
- Once it's up, check the corresponding pods:
```bash
kubectl get pods -l app=yanginx -o yaml | head -n 25
```
]
These pods are owned by a ReplicaSet named yanginx-xxxxxxxxxx.
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## Listing objects with their owners
- This is a good opportunity to try the `custom-columns` output!
.lab[
- Show all pods with their owners:
```bash
kubectl get pod -o custom-columns=\
NAME:.metadata.name,\
OWNER-KIND:.metadata.ownerReferences[0].kind,\
OWNER-NAME:.metadata.ownerReferences[0].name
```
]
Note: the `custom-columns` option should be one long option (without spaces),
so the lines should not be indented (otherwise the indentation will insert spaces).
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## Deletion policy
- When deleting an object through the API, three policies are available:
- foreground (API call returns after all dependents are deleted)
- background (API call returns immediately; dependents are scheduled for deletion)
- orphan (the dependents are not deleted)
- When deleting an object with `kubectl`, this is selected with `--cascade`:
- `--cascade=true` deletes all dependent objects (default)
- `--cascade=false` orphans dependent objects
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## What happens when an object is deleted
- It is removed from the list of owners of its dependents
- If, for one of these dependents, the list of owners becomes empty ...
- if the policy is "orphan", the object stays
- otherwise, the object is deleted
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## Orphaning pods
- We are going to delete the Deployment and Replica Set that we created
- ... without deleting the corresponding pods!
.lab[
- Delete the Deployment:
```bash
kubectl delete deployment -l app=yanginx --cascade=false
```
- Delete the Replica Set:
```bash
kubectl delete replicaset -l app=yanginx --cascade=false
```
- Check that the pods are still here:
```bash
kubectl get pods
```
]
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
class: extra-details
## When and why would we have orphans?
- If we remove an owner and explicitly instruct the API to orphan dependents
(like on the previous slide)
- If we change the labels on a dependent, so that it's not selected anymore
(e.g. change the `app: yanginx` in the pods of the previous example)
- If a deployment tool that we're using does these things for us
- If there is a serious problem within API machinery or other components
(i.e. "this should not happen")
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## Finding orphan objects
- We're going to output all pods in JSON format
- Then we will use `jq` to keep only the ones *without* an owner
- And we will display their name
.lab[
- List all pods that *do not* have an owner:
```bash
kubectl get pod -o json | jq -r "
.items[]
| select(.metadata.ownerReferences|not)
| .metadata.name"
```
]
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
## Deleting orphan pods
- Now that we can list orphan pods, deleting them is easy
.lab[
- Add `| xargs kubectl delete pod` to the previous command:
```bash
kubectl get pod -o json | jq -r "
.items[]
| select(.metadata.ownerReferences|not)
| .metadata.name" | xargs kubectl delete pod
```
]
As always, the [documentation](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/) has useful extra information and pointers.
???
:EN:- Owners and dependents
:FR:- Liens de parenté entre les ressources
.debug[[k8s/owners-and-dependents.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/owners-and-dependents.md)]
---
class: pic
.interstitial[]
---
name: toc-events
class: title
Events
.nav[
[Previous part](#toc-owners-and-dependents)
|
[Back to table of contents](#toc-part-13)
|
[Next part](#toc-building-our-own-cluster-easy)
]
.debug[(automatically generated title slide)]
---
# Events
- Kubernetes has an internal structured log of *events*
- These events are ordinary resources:
- we can view them with `kubectl get events`
- they can be viewed and created through the Kubernetes API
- they are stored in Kubernetes default database (e.g. etcd)
- Most components will generate events to let us know what's going on
- Events can be *related* to other resources
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Reading events
- `kubectl get events` (or `kubectl get ev`)
- Can use `--watch`
⚠️ Looks like `tail -f`, but events aren't necessarily sorted!
- Can use `--all-namespaces`
- Cluster events (e.g. related to nodes) are in the `default` namespace
- Viewing all "non-normal" events:
```bash
kubectl get ev -A --field-selector=type!=Normal
```
(as of Kubernetes 1.19, `type` can be either `Normal` or `Warning`)
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Reading events (take 2)
- When we use `kubectl describe` on an object, `kubectl` retrieves the associated events
.lab[
- See the API requests happening when we use `kubectl describe`:
```bash
kubectl describe service kubernetes --namespace=default -v6 >/dev/null
```
]
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Generating events
- This is rarely (if ever) done manually
(i.e. by crafting some YAML)
- But controllers (e.g. operators) need this!
- It's not mandatory, but it helps with *operability*
(e.g. when we `kubectl describe` a CRD, we will see associated events)
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## ⚠️ Work in progress
- "Events" can be :
- "old-style" events (in core API group, aka `v1`)
- "new-style" events (in API group `events.k8s.io`)
- See [KEP 383](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md) in particular this [comparison between old and new APIs](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md#comparison-between-old-and-new-apis)
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Experimenting with events
- Let's create an event related to a Node, based on [k8s/event-node.yaml](https://github.com/jpetazzo/container.training/tree/master/k8s/event-node.yaml)
.lab[
- Edit `k8s/event-node.yaml`
- Update the `name` and `uid` of the `involvedObject`
- Create the event with `kubectl create -f`
- Look at the Node with `kubectl describe`
]
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Experimenting with events
- Let's create an event related to a Pod, based on [k8s/event-pod.yaml](https://github.com/jpetazzo/container.training/tree/master/k8s/event-pod.yaml)
.lab[
- Create a pod
- Edit `k8s/event-pod.yaml`
- Edit the `involvedObject` section (don't forget the `uid`)
- Create the event with `kubectl create -f`
- Look at the Pod with `kubectl describe`
]
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Generating events in practice
- In Go, use an `EventRecorder` provided by the `kubernetes/client-go` library
- [EventRecorder interface](https://github.com/kubernetes/client-go/blob/release-1.19/tools/record/event.go#L87)
- [kubebuilder book example](https://book-v1.book.kubebuilder.io/beyond_basics/creating_events.html)
- It will take care of formatting / aggregating events
- To get an idea of what to put in the `reason` field, check [kubelet events](
https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/events/event.go)
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
## Cluster operator perspective
- Events are kept 1 hour by default
- This can be changed with the `--event-ttl` flag on the API server
- On very busy clusters, events can be kept on a separate etcd cluster
- This is done with the `--etcd-servers-overrides` flag on the API server
- Example:
```
--etcd-servers-overrides=/events#http://127.0.0.1:12379
```
???
:EN:- Consuming and generating cluster events
:FR:- Suivre l'activité du cluster avec les *events*
.debug[[k8s/events.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/events.md)]
---
class: pic
.interstitial[]
---
name: toc-building-our-own-cluster-easy
class: title
Building our own cluster (easy)
.nav[
[Previous part](#toc-events)
|
[Back to table of contents](#toc-part-14)
|
[Next part](#toc-building-our-own-cluster-medium)
]
.debug[(automatically generated title slide)]
---
# Building our own cluster (easy)
- Let's build our own cluster!
*Perfection is attained not when there is nothing left to add, but when there is nothing left to take away. (Antoine de Saint-Exupery)*
- Our goal is to build a minimal cluster allowing us to:
- create a Deployment (with `kubectl create deployment`)
- expose it with a Service
- connect to that service
- "Minimal" here means:
- smaller number of components
- smaller number of command-line flags
- smaller number of configuration files
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Non-goals
- For now, we don't care about security
- For now, we don't care about scalability
- For now, we don't care about high availability
- All we care about is *simplicity*
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Our environment
- We will use the machine indicated as `monokube1`
- This machine:
- runs Ubuntu LTS
- has Kubernetes, Docker, and etcd binaries installed
- but nothing is running
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## The fine print
- We're going to use a *very old* version of Kubernetes
(specifically, 1.19)
- Why?
- It's much easier to set up than recent versions
- it's compatible with Docker (no need to set up CNI)
- it doesn't require a ServiceAccount keypair
- it can be exposed over plain HTTP (insecure but easier)
- We'll do that, and later, move to recent versions of Kubernetes!
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Checking our environment
- Let's make sure we have everything we need first
.lab[
- Log into the `monokube1` machine
- Get root:
```bash
sudo -i
```
- Check available versions:
```bash
etcd -version
kube-apiserver --version
dockerd --version
```
]
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## The plan
1. Start API server
2. Interact with it (create Deployment and Service)
3. See what's broken
4. Fix it and go back to step 2 until it works!
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Dealing with multiple processes
- We are going to start many processes
- Depending on what you're comfortable with, you can:
- open multiple windows and multiple SSH connections
- use a terminal multiplexer like screen or tmux
- put processes in the background with `&`
(warning: log output might get confusing to read!)
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting API server
.lab[
- Try to start the API server:
```bash
kube-apiserver
# It will fail with "--etcd-servers must be specified"
```
]
Since the API server stores everything in etcd,
it cannot start without it.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting etcd
.lab[
- Try to start etcd:
```bash
etcd
```
]
Success!
Note the last line of output:
```
serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
```
*Sure, that's discouraged. But thanks for telling us the address!*
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting API server (for real)
- Try again, passing the `--etcd-servers` argument
- That argument should be a comma-separated list of URLs
.lab[
- Start API server:
```bash
kube-apiserver --etcd-servers http://127.0.0.1:2379
```
]
Success!
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Interacting with API server
- Let's try a few "classic" commands
.lab[
- List nodes:
```bash
kubectl get nodes
```
- List services:
```bash
kubectl get services
```
]
We should get `No resources found.` and the `kubernetes` service, respectively.
Note: the API server automatically created the `kubernetes` service entry.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
class: extra-details
## What about `kubeconfig`?
- We didn't need to create a `kubeconfig` file
- By default, the API server is listening on `localhost:8080`
(without requiring authentication)
- By default, `kubectl` connects to `localhost:8080`
(without providing authentication)
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Creating a Deployment
- Let's run a web server!
.lab[
- Create a Deployment with NGINX:
```bash
kubectl create deployment web --image=nginx
```
]
Success?
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Checking our Deployment status
.lab[
- Look at pods, deployments, etc.:
```bash
kubectl get all
```
]
Our Deployment is in bad shape:
```
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/web 0/1 0 0 2m26s
```
And, there is no ReplicaSet, and no Pod.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## What's going on?
- We stored the definition of our Deployment in etcd
(through the API server)
- But there is no *controller* to do the rest of the work
- We need to start the *controller manager*
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting the controller manager
.lab[
- Try to start the controller manager:
```bash
kube-controller-manager
```
]
The final error message is:
```
invalid configuration: no configuration has been provided
```
But the logs include another useful piece of information:
```
Neither --kubeconfig nor --master was specified.
Using the inClusterConfig. This might not work.
```
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Reminder: everyone talks to API server
- The controller manager needs to connect to the API server
- It *does not* have a convenient `localhost:8080` default
- We can pass the connection information in two ways:
- `--master` and a host:port combination (easy)
- `--kubeconfig` and a `kubeconfig` file
- For simplicity, we'll use the first option
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting the controller manager (for real)
.lab[
- Start the controller manager:
```bash
kube-controller-manager --master http://localhost:8080
```
]
Success!
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Checking our Deployment status
.lab[
- Check all our resources again:
```bash
kubectl get all
```
]
We now have a ReplicaSet.
But we still don't have a Pod.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## What's going on?
In the controller manager logs, we should see something like this:
```
E0404 15:46:25.753376 22847 replica_set.go:450] Sync "default/web-5bc9bd5b8d"
failed with `No API token found for service account "default"`, retry after the
token is automatically created and added to the service account
```
- The service account `default` was automatically added to our Deployment
(and to its pods)
- The service account `default` exists
- But it doesn't have an associated token
(the token is a secret; creating it requires signature; therefore a CA)
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Solving the missing token issue
There are many ways to solve that issue.
We are going to list a few (to get an idea of what's happening behind the scenes).
Of course, we don't need to perform *all* the solutions mentioned here.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Option 1: disable service accounts
- Restart the API server with
`--disable-admission-plugins=ServiceAccount`
- The API server will no longer add a service account automatically
- Our pods will be created without a service account
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Option 2: do not mount the (missing) token
- Add `automountServiceAccountToken: false` to the Deployment spec
*or*
- Add `automountServiceAccountToken: false` to the default ServiceAccount
- The ReplicaSet controller will no longer create pods referencing the (missing) token
.lab[
- Programmatically change the `default` ServiceAccount:
```bash
kubectl patch sa default -p "automountServiceAccountToken: false"
```
]
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Option 3: set up service accounts properly
- This is the most complex option!
- Generate a key pair
- Pass the private key to the controller manager
(to generate and sign tokens)
- Pass the public key to the API server
(to verify these tokens)
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Continuing without service account token
- Once we patch the default service account, the ReplicaSet can create a Pod
.lab[
- Check that we now have a pod:
```bash
kubectl get all
```
]
Note: we might have to wait a bit for the ReplicaSet controller to retry.
If we're impatient, we can restart the controller manager.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## What's next?
- Our pod exists, but it is in `Pending` state
- Remember, we don't have a node so far
(`kubectl get nodes` shows an empty list)
- We need to:
- start a container engine
- start kubelet
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting a container engine
- We're going to use Docker (because it's the default option)
.lab[
- Start the Docker Engine:
```bash
dockerd
```
]
Success!
Feel free to check that it actually works with e.g.:
```bash
docker run alpine echo hello world
```
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting kubelet
- If we start kubelet without arguments, it *will* start
- But it will not join the cluster!
- It will start in *standalone* mode
- Just like with the controller manager, we need to tell kubelet where the API server is
- Alas, kubelet doesn't have a simple `--master` option
- We have to use `--kubeconfig`
- We need to write a `kubeconfig` file for kubelet
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Writing a kubeconfig file
- We can copy/paste a bunch of YAML
- Or we can generate the file with `kubectl`
.lab[
- Create the file `~/.kube/config` with `kubectl`:
```bash
kubectl config \
set-cluster localhost --server http://localhost:8080
kubectl config \
set-context localhost --cluster localhost
kubectl config \
use-context localhost
```
]
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Our `~/.kube/config` file
The file that we generated looks like the one below.
That one has been slightly simplified (removing extraneous fields), but it is still valid.
```yaml
apiVersion: v1
kind: Config
current-context: localhost
contexts:
- name: localhost
context:
cluster: localhost
clusters:
- name: localhost
cluster:
server: http://localhost:8080
```
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting kubelet
.lab[
- Start kubelet with that kubeconfig file:
```bash
kubelet --kubeconfig ~/.kube/config
```
]
If it works: great!
If it complains about a "cgroup driver", check the next slide.
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Cgroup drivers
- Cgroups ("control groups") are a Linux kernel feature
- They're used to account and limit resources
(e.g.: memory, CPU, block I/O...)
- There are multiple ways to manipulate cgroups, including:
- through a pseudo-filesystem (typically mounted in /sys/fs/cgroup)
- through systemd
- Kubelet and the container engine need to agree on which method to use
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Setting the cgroup driver
- If kubelet refused to start, mentioning a cgroup driver issue, try:
```bash
kubelet --kubeconfig ~/.kube/config --cgroup-driver=systemd
```
- That *should* do the trick!
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Looking at our 1-node cluster
- Let's check that our node registered correctly
.lab[
- List the nodes in our cluster:
```bash
kubectl get nodes
```
]
Our node should show up.
Its name will be its hostname (it should be `monokube1`).
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Are we there yet?
- Let's check if our pod is running
.lab[
- List all resources:
```bash
kubectl get all
```
]
--
Our pod is still `Pending`. 🤔
--
Which is normal: it needs to be *scheduled*.
(i.e., something needs to decide which node it should go on.)
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Scheduling our pod
- Why do we need a scheduling decision, since we have only one node?
- The node might be full, unavailable; the pod might have constraints ...
- The easiest way to schedule our pod is to start the scheduler
(we could also schedule it manually)
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Starting the scheduler
- The scheduler also needs to know how to connect to the API server
- Just like for controller manager, we can use `--kubeconfig` or `--master`
.lab[
- Start the scheduler:
```bash
kube-scheduler --master http://localhost:8080
```
]
- Our pod should now start correctly
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
## Checking the status of our pod
- Our pod will go through a short `ContainerCreating` phase
- Then it will be `Running`
.lab[
- Check pod status:
```bash
kubectl get pods
```
]
Success!
.debug[[k8s/dmuc-easy.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-easy.md)]
---
class: extra-details
## Scheduling a pod manually
- We can schedule a pod in `Pending` state by creating a Binding, e.g.:
```bash
kubectl create -f- <(warning: log output might get confusing to read!)
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Starting API server
.lab[
- Try to start the API server:
```bash
kube-apiserver
# It will complain about permission to /var/run/kubernetes
sudo kube-apiserver
# Now it will complain about a bunch of missing flags, including:
# --etcd-servers
# --service-account-issuer
# --service-account-signing-key-file
```
]
Just like before, we'll need to start etcd.
But we'll also need some TLS keys!
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Generating TLS keys
- There are many ways to generate TLS keys (and certificates)
- A very popular and modern tool to do that is [cfssl]
- We're going to use the old-fashioned [openssl] CLI
- Feel free to use cfssl or any other tool if you prefer!
[cfssl]: https://github.com/cloudflare/cfssl#using-the-command-line-tool
[openssl]: https://www.openssl.org/docs/man3.0/man1/
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## How many keys do we need?
At the very least, we need the following two keys:
- ServiceAccount key pair
- API client key pair, aka "CA key"
(technically, we will need a *certificate* for that key pair)
But if we wanted to tighten the cluster security, we'd need many more...
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## The other keys
These keys are not strictly necessary at this point:
- etcd key pair
*without that key, communication with etcd will be insecure*
- API server endpoint key pair
*the API server will generate this one automatically if we don't*
- kubelet key pair (used by API server to connect to kubelets)
*without that key, commands like kubectl logs/exec will be insecure*
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Would you like some auth with that?
If we want to enable authentication and authorization, we also need various API client key pairs signed by the "CA key" mentioned earlier. That would include (non-exhaustive list):
- controller manager key pair
- scheduler key pair
- in most cases: kube-proxy (or equivalent) key pair
- in most cases: key pairs for the nodes joining the cluster
(these might be generated through TLS bootstrap tokens)
- key pairs for users that will interact with the clusters
(unless another authentication mechanism like OIDC is used)
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Generating our keys and certificates
.lab[
- Generate the ServiceAccount key pair:
```bash
openssl genrsa -out sa.key 2048
```
- Generate the CA key pair:
```bash
openssl genrsa -out ca.key 2048
```
- Generate a self-signed certificate for the CA key:
```bash
openssl x509 -new -key ca.key -out ca.cert -subj /CN=kubernetes/
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Starting etcd
- This one is easy!
.lab[
- Start etcd:
```bash
etcd
```
]
Note: if you want a bit of extra challenge, you can try
to generate the etcd key pair and use it.
(You will need to pass it to etcd and to the API server.)
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Starting API server
- We need to use the keys and certificate that we just generated
.lab[
- Start the API server:
```bash
sudo kube-apiserver \
--etcd-servers=http://localhost:2379 \
--service-account-signing-key-file=sa.key \
--service-account-issuer=https://kubernetes \
--service-account-key-file=sa.key \
--client-ca-file=ca.cert
```
]
The API server should now start.
But can we really use it? 🤔
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Trying `kubectl`
- Let's try some simple `kubectl` command
.lab[
- Try to list Namespaces:
```bash
kubectl get namespaces
```
]
We're getting an error message like this one:
```
The connection to the server localhost:8080 was refused -
did you specify the right host or port?
```
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## What's going on?
- Recent versions of Kubernetes don't support unauthenticated API access
- The API server doesn't support listening on plain HTTP anymore
- `kubectl` still tries to connect to `localhost:8080` by default
- But there is nothing listening there
- Our API server listens on port 6443, using TLS
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Trying to access the API server
- Let's use `curl` first to confirm that everything works correctly
(and then we will move to `kubectl`)
.lab[
- Try to connect with `curl`:
```bash
curl https://localhost:6443
# This will fail because the API server certificate is unknown.
```
- Try again, skipping certificate verification:
```bash
curl --insecure https://localhost:6443
```
]
We should now see an `Unauthorized` Kubernetes API error message.
We need to authenticate with our key and certificate.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Authenticating with the API server
- For the time being, we can use the CA key and cert directly
- In a real world scenario, we would *never* do that!
(because we don't want the CA key to be out there in the wild)
.lab[
- Try again, skipping cert verification, and using the CA key and cert:
```bash
curl --insecure --key ca.key --cert ca.cert https://localhost:6443
```
]
We should see a list of API routes.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
class: extra-details
## Doing it right
In the future, instead of using the CA key and certificate,
we should generate a new key, and a certificate for that key,
signed by the CA key.
Then we can use that new key and certificate to authenticate.
Example:
```
### Generate a key pair
openssl genrsa -out user.key
### Extract the public key
openssl pkey -in user.key -out user.pub -pubout
### Generate a certificate signed by the CA key
openssl x509 -new -key ca.key -force_pubkey user.pub -out user.cert \
-subj /CN=kubernetes-user/
```
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Writing a kubeconfig file
- We now want to use `kubectl` instead of `curl`
- We'll need to write a kubeconfig file for `kubectl`
- There are many way to do that; here, we're going to use `kubectl config`
- We'll need to:
- set the "cluster" (API server endpoint)
- set the "credentials" (the key and certficate)
- set the "context" (referencing the cluster and credentials)
- use that context (make it the default that `kubectl` will use)
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Set the cluster
The "cluster" section holds the API server endpoint.
.lab[
- Set the API server endpoint:
```bash
kubectl config set-cluster polykube --server=https://localhost:6443
```
- Don't verify the API server certificate:
```bash
kubectl config set-cluster polykube --insecure-skip-tls-verify
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Set the credentials
The "credentials" section can hold a TLS key and certificate, or a token, or configuration information for a plugin (for instance, when using AWS EKS or GCP GKE, they use a plugin).
.lab[
- Set the client key and certificate:
```bash
kubectl config set-credentials polykube \
--client-key ca.key \
--client-certificate ca.cert
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Set and use the context
The "context" section references the "cluster" and "credentials" that we defined earlier.
(It can also optionally reference a Namespace.)
.lab[
- Set the "context":
```bash
kubectl config set-context polykube --cluster polykube --user polykube
```
- Set that context to be the default context:
```bash
kubectl config use-context polykube
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Review the kubeconfig file
The kubeconfig file should look like this:
.small[
```yaml
apiVersion: v1
clusters:
- cluster:
insecure-skip-tls-verify: true
server: https://localhost:6443
name: polykube
contexts:
- context:
cluster: polykube
user: polykube
name: polykube
current-context: polykube
kind: Config
preferences: {}
users:
- name: polykube
user:
client-certificate: /root/ca.cert
client-key: /root/ca.key
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Trying the kubeconfig file
- We should now be able to access our cluster's API!
.lab[
- Try to list Namespaces:
```bash
kubectl get namespaces
```
]
This should show the classic `default`, `kube-system`, etc.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
class: extra-details
## Do we need `--client-ca-file` ?
Technically, we didn't need to specify the `--client-ca-file` flag!
But without that flag, no client can be authenticated.
Which means that we wouldn't be able to issue any API request!
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Running pods
- We can now try to create a Deployment
.lab[
- Create a Deployment:
```bash
kubectl create deployment blue --image=jpetazzo/color
```
- Check the results:
```bash
kubectl get deployments,replicasets,pods
```
]
Our Deployment exists, but not the Replica Set or Pod.
We need to run the controller manager.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Running the controller manager
- Previously, we used the `--master` flag to pass the API server address
- Now, we need to authenticate properly
- The simplest way at this point is probably to use the same kubeconfig file!
.lab[
- Start the controller manager:
```bash
kube-controller-manager --kubeconfig .kube/config
```
- Check the results:
```bash
kubectl get deployments,replicasets,pods
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## What's next?
- Normally, the last commands showed us a Pod in `Pending` state
- We need two things to continue:
- the scheduler (to assign the Pod to a Node)
- a Node!
- We're going to run `kubelet` to register the Node with the cluster
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Running `kubelet`
- Let's try to run `kubelet` and see what happens!
.lab[
- Start `kubelet`:
```bash
sudo kubelet
```
]
We should see an error about connecting to `containerd.sock`.
We need to run a container engine!
(For instance, `containerd`.)
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Running `containerd`
- We need to install and start `containerd`
- You could try another engine if you wanted
(but there might be complications!)
.lab[
- Install `containerd`:
```bash
sudo apt-get install containerd
```
- Start `containerd`:
```bash
sudo containerd
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
class: extra-details
## Configuring `containerd`
Depending on how we install `containerd`, it might need a bit of extra configuration.
Watch for the following symptoms:
- `containerd` refuses to start
(rare, unless there is an *invalid* configuration)
- `containerd` starts but `kubelet` can't connect
(could be the case if the configuration disables the CRI socket)
- `containerd` starts and things work but Pods keep being killed
(may happen if there is a mismatch in the cgroups driver)
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Starting `kubelet` for good
- Now that `containerd` is running, `kubelet` should start!
.lab[
- Try to start `kubelet`:
```bash
sudo kubelet
```
- In another terminal, check if our Node is now visible:
```bash
sudo kubectl get nodes
```
]
`kubelet` should now start, but our Node doesn't show up in `kubectl get nodes`!
This is because without a kubeconfig file, `kubelet` runs in standalone mode:
it will not connect to a Kubernetes API server, and will only start *static pods*.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Passing the kubeconfig file
- Let's start `kubelet` again, with our kubeconfig file
.lab[
- Stop `kubelet` (e.g. with `Ctrl-C`)
- Restart it with the kubeconfig file:
```bash
sudo kubelet --kubeconfig .kube/config
```
- Check our list of Nodes:
```bash
kubectl get nodes
```
]
This time, our Node should show up!
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Node readiness
- However, our Node shows up as `NotReady`
- If we wait a few minutes, the `kubelet` logs will tell us why:
*we're missing a CNI configuration!*
- As a result, the containers can't be connected to the network
- `kubelet` detects that and doesn't become `Ready` until this is fixed
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## CNI configuration
- We need to provide a CNI configuration
- This is a file in `/etc/cni/net.d`
(the name of the file doesn't matter; the first file in lexicographic order will be used)
- Usually, when installing a "CNI plugin¹", this file gets installed automatically
- Here, we are going to write that file manually
.footnote[¹Technically, a *pod network*; typically running as a DaemonSet, which will install the file with a `hostPath` volume.]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Our CNI configuration
Create the following file in e.g. `/etc/cni/net.d/kube.conf`:
```json
{
"cniVersion": "0.3.1",
"name": "kube",
"type": "bridge",
"bridge": "cni0",
"isDefaultGateway": true,
"ipMasq": true,
"hairpinMode": true,
"ipam": {
"type": "host-local",
"subnet": "10.1.1.0/24"
}
}
```
That's all we need - `kubelet` will detect and validate the file automatically!
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Checking our Node again
- After a short time (typically about 10 seconds) the Node should be `Ready`
.lab[
- Wait until the Node is `Ready`:
```bash
kubectl get nodes
```
]
If the Node doesn't show up as `Ready`, check the `kubelet` logs.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## What's next?
- At this point, we have a `Pending` Pod and a `Ready` Node
- All we need is the scheduler to bind the former to the latter
.lab[
- Run the scheduler:
```bash
kube-scheduler --kubeconfig .kube/config
```
- Check that the Pod gets assigned to the Node and becomes `Running`:
```bash
kubectl get pods
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Check network access
- Let's check that we can connect to our Pod, and that the Pod can connect outside
.lab[
- Get the Pod's IP address:
```bash
kubectl get pods -o wide
```
- Connect to the Pod (make sure to update the IP address):
```bash
curl `10.1.1.2`
```
- Check that the Pod has external connectivity too:
```bash
kubectl exec `blue-xxxxxxxxxx-yyyyy` -- ping -c3 1.1
```
]
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Expose our Deployment
- We can now try to expose the Deployment and connect to the ClusterIP
.lab[
- Expose the Deployment:
```bash
kubectl expose deployment blue --port=80
```
- Retrieve the ClusterIP:
```bash
kubectl get services
```
- Try to connect to the ClusterIP:
```bash
curl `10.0.0.42`
```
]
At this point, it won't work - we need to run `kube-proxy`!
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## Running `kube-proxy`
- We need to run `kube-proxy`
(also passing it our kubeconfig file)
.lab[
- Run `kube-proxy`:
```bash
sudo kube-proxy --kubeconfig .kube/config
```
- Try again to connect to the ClusterIP:
```bash
curl `10.0.0.42`
```
]
This time, it should work.
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
## What's next?
- Scale up the Deployment, and check that load balancing works properly
- Enable RBAC, and generate individual certificates for each controller
(check the [certificate paths][certpath] section in the Kubernetes documentation
for a detailed list of all the certificates and keys that are used by the
control plane, and which flags are used by which components to configure them!)
- Add more nodes to the cluster
*Feel free to try these if you want to get additional hands-on experience!*
[certpath]: https://kubernetes.io/docs/setup/best-practices/certificates/#certificate-paths
???
:EN:- Setting up control plane certificates
:EN:- Implementing a basic CNI configuration
:FR:- Mettre en place les certificats du plan de contrôle
:FR:- Réaliser un configuration CNI basique
.debug[[k8s/dmuc-medium.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-medium.md)]
---
class: pic
.interstitial[]
---
name: toc-building-our-own-cluster-hard
class: title
Building our own cluster (hard)
.nav[
[Previous part](#toc-building-our-own-cluster-medium)
|
[Back to table of contents](#toc-part-14)
|
[Next part](#toc-cni-internals)
]
.debug[(automatically generated title slide)]
---
# Building our own cluster (hard)
- This section assumes that you already went through
*“Building our own cluster (medium)”*
- In that previous section, we built a cluster with a single node
- In this new section, we're going to add more nodes to the cluster
- Note: we will need the lab environment of that previous section
- If you haven't done it yet, you should go through that section first
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Our environment
- On `polykube1`, we should have our Kubernetes control plane
- We're also assuming that we have the kubeconfig file created earlier
(in `~/.kube/config`)
- We're going to work on `polykube2` and add it to the cluster
- This machine has exactly the same setup as `polykube1`
(Ubuntu LTS with CNI, etcd, and Kubernetes binaries installed)
- Note that we won't need the etcd binaries here
(the control plane will run solely on `polykube1`)
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Checklist
We need to:
- generate the kubeconfig file for `polykube2`
- install a container engine
- generate a CNI configuration file
- start kubelet
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Generating the kubeconfig file
- Ideally, we should generate a key pair and certificate for `polykube2`...
- ...and generate a kubeconfig file using these
- At the moment, for simplicity, we'll use the same key pair and certificate as earlier
- We have a couple of options:
- copy the required files (kubeconfig, key pair, certificate)
- "flatten" the kubeconfig file (embed the key and certificate within)
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
class: extra-details
## To flatten or not to flatten?
- "Flattening" the kubeconfig file can seem easier
(because it means we'll only have one file to move around)
- But it's easier to rotate the key or renew the certificate when they're in separate files
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Flatten and copy the kubeconfig file
- We'll flatten the file and copy it over
.lab[
- On `polykube1`, flatten the kubeconfig file:
```bash
kubectl config view --flatten > kubeconfig
```
- Then copy it to `polykube2`:
```bash
scp kubeconfig polykube2:
```
]
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Generate CNI configuration
Back on `polykube2`, put the following in `/etc/cni/net.d/kube.conf`:
```json
{
"cniVersion": "0.3.1",
"name": "kube",
"type": "bridge",
"bridge": "cni0",
"isDefaultGateway": true,
"ipMasq": true,
"hairpinMode": true,
"ipam": {
"type": "host-local",
"subnet": `"10.1.2.0/24"`
}
}
```
Note how we changed the subnet!
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Install container engine and start `kubelet`
.lab[
- Install `containerd`:
```bash
sudo apt-get install containerd -y
```
- Start `containerd`:
```bash
sudo systemctl start containerd
```
- Start `kubelet`:
```bash
sudo kubelet --kubeconfig kubeconfig
```
]
We're getting errors looking like:
```
"Post \"https://localhost:6443/api/v1/nodes\": ... connect: connection refused"
```
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Updating the kubeconfig file
- Our kubeconfig file still references `localhost:6443`
- This was fine on `polykube1`
(where `kubelet` was connecting to the control plane running locally)
- On `polykube2`, we need to change that and put the address of the API server
(i.e. the address of `polykube1`)
.lab[
- Update the `kubeconfig` file:
```bash
sed -i s/localhost:6443/polykube1:6443/ kubeconfig
```
]
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Starting `kubelet`
- `kubelet` should now start correctly (hopefully!)
.lab[
- On `polykube2`, start `kubelet`:
```bash
sudo kubelet --kubeconfig kubeconfig
```
- On `polykube1`, check that `polykube2` shows up and is `Ready`:
```bash
kubectl get nodes
```
]
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Testing connectivity
- From `polykube1`, can we connect to Pods running on `polykube2`? 🤔
.lab[
- Scale the test Deployment:
```bash
kubectl scale deployment blue --replicas=5
```
- Get the IP addresses of the Pods:
```bash
kubectl get pods -o wide
```
- Pick a Pod on `polykube2` and try to connect to it:
```bash
curl `10.1.2.2`
```
]
--
At that point, it doesn't work.
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Refresher on the *pod network*
- The *pod network* (or *pod-to-pod network*) has a few responsibilities:
- allocating and managing Pod IP addresses
- connecting Pods and Nodes
- connecting Pods together on a given node
- *connecting Pods together across nodes*
- That last part is the one that's not functioning in our cluster
- It typically requires some combination of routing, tunneling, bridging...
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Connecting networks together
- We can add manual routes between our nodes
- This requires adding `N x (N-1)` routes
(on each node, add a route to every other node)
- This will work on home labs where nodes are directly connected
(e.g. on an Ethernet switch, or same WiFi network, or a bridge between local VMs)
- ...Or on clouds where IP address filtering has been disabled
(by default, most cloud providers will discard packets going to unknown IP addresses)
- If IP address filtering is enabled, you'll have to use e.g. tunneling or overlay networks
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Important warning
- The technique that we are about to use doesn't work everywhere
- It only works if:
- all the nodes are directly connected to each other (at layer 2)
- the underlying network allows the IP addresses of our pods
- If we are on physical machines connected by a switch: OK
- If we are on virtual machines in a public cloud: NOT OK
- on AWS, we need to disable "source and destination checks" on our instances
- on OpenStack, we need to disable "port security" on our network ports
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Routing basics
- We need to tell *each* node:
"The subnet 10.1.N.0/24 is located on node N" (for all values of N)
- This is how we add a route on Linux:
```bash
ip route add 10.1.N.0/24 via W.X.Y.Z
```
(where `W.X.Y.Z` is the internal IP address of node N)
- We can see the internal IP addresses of our nodes with:
```bash
kubectl get nodes -o wide
```
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## Adding our route
- Let's add a route from `polykube1` to `polykube2`
.lab[
- Check the internal address of `polykube2`:
```bash
kubectl get node polykube2 -o wide
```
- Now, on `polykube1`, add the route to the Pods running on `polykube2`:
```bash
sudo ip route add 10.1.2.0/24 via `A.B.C.D`
```
- Finally, check that we can now connect to a Pod running on `polykube2`:
```bash
curl 10.1.2.2
```
]
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
## What's next?
- The network configuration feels very manual:
- we had to generate the CNI configuration file (in `/etc/cni/net.d`)
- we had to manually update the nodes' routing tables
- Can we automate that?
**YES!**
- We could install something like [kube-router](https://www.kube-router.io/)
(which specifically takes care of the CNI configuration file and populates routing tables)
- Or we could also go with e.g. [Cilium](https://cilium.io/)
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
class: extra-details
## If you want to try Cilium...
- Add the `--root-ca-file` flag to the controller manager:
- use the certificate automatically generated by the API server
(it should be in `/var/run/kubernetes/apiserver.crt`)
- or generate a key pair and certificate for the API server and point to
that certificate
- without that, you'll get certificate validation errors
(because in our Pods, the `ca.crt` file used to validate the API server will be empty)
- Check the Cilium [without kube-proxy][ciliumwithoutkubeproxy] instructions
(make sure to pass the API server IP address and port!)
- Other pod-to-pod network implementations might also require additional steps
[ciliumwithoutkubeproxy]: https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#kubeproxy-free
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
class: extra-details
## About the API server certificate...
- In the previous sections, we've skipped API server certificate verification
- To generate a proper certificate, we need to include a `subjectAltName` extension
- And make sure that the CA includes the extension in the certificate
```bash
openssl genrsa -out apiserver.key 4096
openssl req -new -key apiserver.key -subj /CN=kubernetes/ \
-addext "subjectAltName = DNS:kubernetes.default.svc, \
DNS:kubernetes.default, DNS:kubernetes, \
DNS:localhost, DNS:polykube1" -out apiserver.csr
openssl x509 -req -in apiserver.csr -CAkey ca.key -CA ca.cert \
-out apiserver.crt -copy_extensions copy
```
???
:EN:- Connecting nodes and pods
:FR:- Interconnecter les nœuds et les pods
.debug[[k8s/dmuc-hard.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/dmuc-hard.md)]
---
class: pic
.interstitial[]
---
name: toc-cni-internals
class: title
CNI internals
.nav[
[Previous part](#toc-building-our-own-cluster-hard)
|
[Back to table of contents](#toc-part-14)
|
[Next part](#toc-api-server-availability)
]
.debug[(automatically generated title slide)]
---
# CNI internals
- Kubelet looks for a CNI configuration file
(by default, in `/etc/cni/net.d`)
- Note: if we have multiple files, the first one will be used
(in lexicographic order)
- If no configuration can be found, kubelet holds off on creating containers
(except if they are using `hostNetwork`)
- Let's see how exactly plugins are invoked!
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## General principle
- A plugin is an executable program
- It is invoked with by kubelet to set up / tear down networking for a container
- It doesn't take any command-line argument
- However, it uses environment variables to know what to do, which container, etc.
- It reads JSON on stdin, and writes back JSON on stdout
- There will generally be multiple plugins invoked in a row
(at least IPAM + network setup; possibly more)
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## Environment variables
- `CNI_COMMAND`: `ADD`, `DEL`, `CHECK`, or `VERSION`
- `CNI_CONTAINERID`: opaque identifier
(container ID of the "sandbox", i.e. the container running the `pause` image)
- `CNI_NETNS`: path to network namespace pseudo-file
(e.g. `/var/run/netns/cni-0376f625-29b5-7a21-6c45-6a973b3224e5`)
- `CNI_IFNAME`: interface name, usually `eth0`
- `CNI_PATH`: path(s) with plugin executables (e.g. `/opt/cni/bin`)
- `CNI_ARGS`: "extra arguments" (see next slide)
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## `CNI_ARGS`
- Extra key/value pair arguments passed by "the user"
- "The user", here, is "kubelet" (or in an abstract way, "Kubernetes")
- This is used to pass the pod name and namespace to the CNI plugin
- Example:
```
IgnoreUnknown=1
K8S_POD_NAMESPACE=default
K8S_POD_NAME=web-96d5df5c8-jcn72
K8S_POD_INFRA_CONTAINER_ID=016493dbff152641d334d9828dab6136c1ff...
```
Note that technically, it's a `;`-separated list, so it really looks like this:
```
CNI_ARGS=IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=web-96d...
```
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## JSON in, JSON out
- The plugin reads its configuration on stdin
- It writes back results in JSON
(e.g. allocated address, routes, DNS...)
⚠️ "Plugin configuration" is not always the same as "CNI configuration"!
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## Conf vs Conflist
- The CNI configuration can be a single plugin configuration
- it will then contain a `type` field in the top-most structure
- it will be passed "as-is"
- It can also be a "conflist", containing a chain of plugins
(it will then contain a `plugins` field in the top-most structure)
- Plugins are then invoked in order (reverse order for `DEL` action)
- In that case, the input of the plugin is not the whole configuration
(see details on next slide)
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## List of plugins
- When invoking a plugin in a list, the JSON input will be:
- the configuration of the plugin
- augmented with `name` (matching the conf list `name`)
- augmented with `prevResult` (which will be the output of the previous plugin)
- Conceptually, a plugin (generally the first one) will do the "main setup"
- Other plugins can do tuning / refinement (firewalling, traffic shaping...)
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## Analyzing plugins
- Let's see what goes in and out of our CNI plugins!
- We will create a fake plugin that:
- saves its environment and input
- executes the real plugin with the saved input
- saves the plugin output
- passes the saved output
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## Our fake plugin
```bash
#!/bin/sh
PLUGIN=$(basename $0)
cat > /tmp/cni.$$.$PLUGIN.in
env | sort > /tmp/cni.$$.$PLUGIN.env
echo "PPID=$PPID, $(readlink /proc/$PPID/exe)" > /tmp/cni.$$.$PLUGIN.parent
$0.real < /tmp/cni.$$.$PLUGIN.in > /tmp/cni.$$.$PLUGIN.out
EXITSTATUS=$?
cat /tmp/cni.$$.$PLUGIN.out
exit $EXITSTATUS
```
Save this script as `/opt/cni/bin/debug` and make it executable.
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## Substituting the fake plugin
- For each plugin that we want to instrument:
- rename the plugin from e.g. `ptp` to `ptp.real`
- symlink `ptp` to our `debug` plugin
- There is no need to change the CNI configuration or restart kubelet
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
## Create some pods and looks at the results
- Create a pod
- For each instrumented plugin, there will be files in `/tmp`:
`cni.PID.pluginname.in` (JSON input)
`cni.PID.pluginname.env` (environment variables)
`cni.PID.pluginname.parent` (parent process information)
`cni.PID.pluginname.out` (JSON output)
❓️ What is calling our plugins?
???
:EN:- Deep dive into CNI internals
:FR:- La Container Network Interface (CNI) en détails
.debug[[k8s/cni-internals.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cni-internals.md)]
---
class: pic
.interstitial[]
---
name: toc-api-server-availability
class: title
API server availability
.nav[
[Previous part](#toc-cni-internals)
|
[Back to table of contents](#toc-part-14)
|
[Next part](#toc-static-pods)
]
.debug[(automatically generated title slide)]
---
# API server availability
- When we set up a node, we need the address of the API server:
- for kubelet
- for kube-proxy
- sometimes for the pod network system (like kube-router)
- How do we ensure the availability of that endpoint?
(what if the node running the API server goes down?)
.debug[[k8s/apilb.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apilb.md)]
---
## Option 1: external load balancer
- Set up an external load balancer
- Point kubelet (and other components) to that load balancer
- Put the node(s) running the API server behind that load balancer
- Update the load balancer if/when an API server node needs to be replaced
- On cloud infrastructures, some mechanisms provide automation for this
(e.g. on AWS, an Elastic Load Balancer + Auto Scaling Group)
- [Example in Kubernetes The Hard Way](https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/docs/08-bootstrapping-kubernetes-controllers.md#the-kubernetes-frontend-load-balancer)
.debug[[k8s/apilb.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apilb.md)]
---
## Option 2: local load balancer
- Set up a load balancer (like NGINX, HAProxy...) on *each* node
- Configure that load balancer to send traffic to the API server node(s)
- Point kubelet (and other components) to `localhost`
- Update the load balancer configuration when API server nodes are updated
.debug[[k8s/apilb.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apilb.md)]
---
## Updating the local load balancer config
- Distribute the updated configuration (push)
- Or regularly check for updates (pull)
- The latter requires an external, highly available store
(it could be an object store, an HTTP server, or even DNS...)
- Updates can be facilitated by a DaemonSet
(but remember that it can't be used when installing a new node!)
.debug[[k8s/apilb.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apilb.md)]
---
## Option 3: DNS records
- Put all the API server nodes behind a round-robin DNS
- Point kubelet (and other components) to that name
- Update the records when needed
- Note: this option is not officially supported
(but since kubelet supports reconnection anyway, it *should* work)
.debug[[k8s/apilb.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apilb.md)]
---
## Option 4: ....................
- Many managed clusters expose a high-availability API endpoint
(and you don't have to worry about it)
- You can also use HA mechanisms that you're familiar with
(e.g. virtual IPs)
- Tunnels are also fine
(e.g. [k3s](https://k3s.io/) uses a tunnel to allow each node to contact the API server)
???
:EN:- Ensuring API server availability
:FR:- Assurer la disponibilité du serveur API
.debug[[k8s/apilb.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/apilb.md)]
---
class: pic
.interstitial[]
---
name: toc-static-pods
class: title
Static pods
.nav[
[Previous part](#toc-api-server-availability)
|
[Back to table of contents](#toc-part-14)
|
[Next part](#toc-upgrading-clusters)
]
.debug[(automatically generated title slide)]
---
# Static pods
- Hosting the Kubernetes control plane on Kubernetes has advantages:
- we can use Kubernetes' replication and scaling features for the control plane
- we can leverage rolling updates to upgrade the control plane
- However, there is a catch:
- deploying on Kubernetes requires the API to be available
- the API won't be available until the control plane is deployed
- How can we get out of that chicken-and-egg problem?
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## A possible approach
- Since each component of the control plane can be replicated...
- We could set up the control plane outside of the cluster
- Then, once the cluster is fully operational, create replicas running on the cluster
- Finally, remove the replicas that are running outside of the cluster
*What could possibly go wrong?*
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Sawing off the branch you're sitting on
- What if anything goes wrong?
(During the setup or at a later point)
- Worst case scenario, we might need to:
- set up a new control plane (outside of the cluster)
- restore a backup from the old control plane
- move the new control plane to the cluster (again)
- This doesn't sound like a great experience
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Static pods to the rescue
- Pods are started by kubelet (an agent running on every node)
- To know which pods it should run, the kubelet queries the API server
- The kubelet can also get a list of *static pods* from:
- a directory containing one (or multiple) *manifests*, and/or
- a URL (serving a *manifest*)
- These "manifests" are basically YAML definitions
(As produced by `kubectl get pod my-little-pod -o yaml`)
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Static pods are dynamic
- Kubelet will periodically reload the manifests
- It will start/stop pods accordingly
(i.e. it is not necessary to restart the kubelet after updating the manifests)
- When connected to the Kubernetes API, the kubelet will create *mirror pods*
- Mirror pods are copies of the static pods
(so they can be seen with e.g. `kubectl get pods`)
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Bootstrapping a cluster with static pods
- We can run control plane components with these static pods
- They can start without requiring access to the API server
- Once they are up and running, the API becomes available
- These pods are then visible through the API
(We cannot upgrade them from the API, though)
*This is how kubeadm has initialized our clusters.*
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Static pods vs normal pods
- The API only gives us read-only access to static pods
- We can `kubectl delete` a static pod...
...But the kubelet will re-mirror it immediately
- Static pods can be selected just like other pods
(So they can receive service traffic)
- A service can select a mixture of static and other pods
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## From static pods to normal pods
- Once the control plane is up and running, it can be used to create normal pods
- We can then set up a copy of the control plane in normal pods
- Then the static pods can be removed
- The scheduler and the controller manager use leader election
(Only one is active at a time; removing an instance is seamless)
- Each instance of the API server adds itself to the `kubernetes` service
- Etcd will typically require more work!
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## From normal pods back to static pods
- Alright, but what if the control plane is down and we need to fix it?
- We restart it using static pods!
- This can be done automatically with a “pod checkpointer”
- The pod checkpointer automatically generates manifests of running pods
- The manifests are used to restart these pods if API contact is lost
- This pattern is implemented in [openshift/pod-checkpointer-operator] and [bootkube checkpointer]
- Unfortunately, as of 2021, both seem abandoned / unmaintained 😢
[openshift/pod-checkpointer-operator]: https://github.com/openshift/pod-checkpointer-operator
[bootkube checkpointer]: https://github.com/kubernetes-retired/bootkube/blob/master/cmd/checkpoint/README.md
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Where should the control plane run?
*Is it better to run the control plane in static pods, or normal pods?*
- If I'm a *user* of the cluster: I don't care, it makes no difference to me
- What if I'm an *admin*, i.e. the person who installs, upgrades, repairs... the cluster?
- If I'm using a managed Kubernetes cluster (AKS, EKS, GKE...) it's not my problem
(I'm not the one setting up and managing the control plane)
- If I already picked a tool (kubeadm, kops...) to set up my cluster, the tool decides for me
- What if I haven't picked a tool yet, or if I'm installing from scratch?
- static pods = easier to set up, easier to troubleshoot, less risk of outage
- normal pods = easier to upgrade, easier to move (if nodes need to be shut down)
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
## Static pods in action
- On our clusters, the `staticPodPath` is `/etc/kubernetes/manifests`
.lab[
- Have a look at this directory:
```bash
ls -l /etc/kubernetes/manifests
```
]
We should see YAML files corresponding to the pods of the control plane.
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
class: static-pods-exercise
## Running a static pod
- We are going to add a pod manifest to the directory, and kubelet will run it
.lab[
- Copy a manifest to the directory:
```bash
sudo cp ~/container.training/k8s/just-a-pod.yaml /etc/kubernetes/manifests
```
- Check that it's running:
```bash
kubectl get pods
```
]
The output should include a pod named `hello-node1`.
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
class: static-pods-exercise
## Remarks
In the manifest, the pod was named `hello`.
```yaml
apiVersion: v1
kind: Pod
metadata:
name: hello
namespace: default
spec:
containers:
- name: hello
image: nginx
```
The `-node1` suffix was added automatically by kubelet.
If we delete the pod (with `kubectl delete`), it will be recreated immediately.
To delete the pod, we need to delete (or move) the manifest file.
???
:EN:- Static pods
:FR:- Les *static pods*
.debug[[k8s/staticpods.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/staticpods.md)]
---
class: pic
.interstitial[]
---
name: toc-upgrading-clusters
class: title
Upgrading clusters
.nav[
[Previous part](#toc-static-pods)
|
[Back to table of contents](#toc-part-15)
|
[Next part](#toc-backing-up-clusters)
]
.debug[(automatically generated title slide)]
---
# Upgrading clusters
- It's *recommended* to run consistent versions across a cluster
(mostly to have feature parity and latest security updates)
- It's not *mandatory*
(otherwise, cluster upgrades would be a nightmare!)
- Components can be upgraded one at a time without problems
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Checking what we're running
- It's easy to check the version for the API server
.lab[
- Log into node `oldversion1`
- Check the version of kubectl and of the API server:
```bash
kubectl version
```
]
- In a HA setup with multiple API servers, they can have different versions
- Running the command above multiple times can return different values
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Node versions
- It's also easy to check the version of kubelet
.lab[
- Check node versions (includes kubelet, kernel, container engine):
```bash
kubectl get nodes -o wide
```
]
- Different nodes can run different kubelet versions
- Different nodes can run different kernel versions
- Different nodes can run different container engines
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Control plane versions
- If the control plane is self-hosted (running in pods), we can check it
.lab[
- Show image versions for all pods in `kube-system` namespace:
```bash
kubectl --namespace=kube-system get pods -o json \
| jq -r '
.items[]
| [.spec.nodeName, .metadata.name]
+
(.spec.containers[].image | split(":"))
| @tsv
' \
| column -t
```
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## What version are we running anyway?
- When I say, "I'm running Kubernetes 1.28", is that the version of:
- kubectl
- API server
- kubelet
- controller manager
- something else?
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Other versions that are important
- etcd
- kube-dns or CoreDNS
- CNI plugin(s)
- Network controller, network policy controller
- Container engine
- Linux kernel
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Important questions
- Should we upgrade the control plane before or after the kubelets?
- Within the control plane, should we upgrade the API server first or last?
- How often should we upgrade?
- How long are versions maintained?
- All the answers are in [the documentation about version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy/)!
- Let's review the key elements together ...
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Kubernetes uses semantic versioning
- Kubernetes versions look like MAJOR.MINOR.PATCH; e.g. in 1.28.9:
- MAJOR = 1
- MINOR = 28
- PATCH = 9
- It's always possible to mix and match different PATCH releases
(e.g. 1.28.9 and 1.28.13 are compatible)
- It is recommended to run the latest PATCH release
(but it's mandatory only when there is a security advisory)
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Version skew
- API server must be more recent than its clients (kubelet and control plane)
- ... Which means it must always be upgraded first
- All components support a difference of one¹ MINOR version
- This allows live upgrades (since we can mix e.g. 1.28 and 1.29)
- It also means that going from 1.28 to 1.30 requires going through 1.29
.footnote[¹Except kubelet, which can be up to two MINOR behind API server,
and kubectl, which can be one MINOR ahead or behind API server.]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Release cycle
- There is a new PATCH relese whenever necessary
(every few weeks, or "ASAP" when there is a security vulnerability)
- There is a new MINOR release every 3 months (approximately)
- At any given time, three MINOR releases are maintained
- ... Which means that MINOR releases are maintained approximately 9 months
- We should expect to upgrade at least every 3 months (on average)
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## General guidelines
- To update a component, use whatever was used to install it
- If it's a distro package, update that distro package
- If it's a container or pod, update that container or pod
- If you used configuration management, update with that
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Know where your binaries come from
- Sometimes, we need to upgrade *quickly*
(when a vulnerability is announced and patched)
- If we are using an installer, we should:
- make sure it's using upstream packages
- or make sure that whatever packages it uses are current
- make sure we can tell it to pin specific component versions
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## In practice
- We are going to update a few cluster components
- We will change the kubelet version on one node
- We will change the version of the API server
- We will work with cluster `oldversion` (nodes `oldversion1`, `oldversion2`, `oldversion3`)
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Updating the API server
- This cluster has been deployed with kubeadm
- The control plane runs in *static pods*
- These pods are started automatically by kubelet
(even when kubelet can't contact the API server)
- They are defined in YAML files in `/etc/kubernetes/manifests`
(this path is set by a kubelet command-line flag)
- kubelet automatically updates the pods when the files are changed
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Changing the API server version
- We will edit the YAML file to use a different image version
.lab[
- Log into node `oldversion1`
- Check API server version:
```bash
kubectl version
```
- Edit the API server pod manifest:
```bash
sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
```
- Look for the `image:` line, and update it to e.g. `v1.30.1`
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Checking what we've done
- The API server will be briefly unavailable while kubelet restarts it
.lab[
- Check the API server version:
```bash
kubectl version
```
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Was that a good idea?
--
**No!**
--
- Remember the guideline we gave earlier:
*To update a component, use whatever was used to install it.*
- This control plane was deployed with kubeadm
- We should use kubeadm to upgrade it!
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Updating the whole control plane
- Let's make it right, and use kubeadm to upgrade the entire control plane
(note: this is possible only because the cluster was installed with kubeadm)
.lab[
- Check what will be upgraded:
```bash
sudo kubeadm upgrade plan
```
]
Note 1: kubeadm thinks that our cluster is running 1.24.1.
It is confused by our manual upgrade of the API server!
Note 2: kubeadm itself is still version 1.22.1..
It doesn't know how to upgrade do 1.23.X.
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Upgrading kubeadm
- First things first: we need to upgrade kubeadm
- The Kubernetes package repositories are now split by minor versions
(i.e. there is one repository for 1.28, another for 1.29, etc.)
- This avoids accidentally upgrading from one minor version to another
(e.g. with unattended upgrades or if packages haven't been held/pinned)
- We'll need to add the new package repository and unpin packages!
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Installing the new packages
- Edit `/etc/apt/sources.list.d/kubernetes.list`
(or copy it to e.g. `kubernetes-1.29.list` and edit that)
- `apt-get update`
- Now edit (or remove) `/etc/apt/preferences.d/kubernetes`
- `apt-get install kubeadm` should now upgrade `kubeadm` correctly! 🎉
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Reverting our manual API server upgrade
- First, we should revert our `image:` change
(so that kubeadm executes the right migration steps)
.lab[
- Edit the API server pod manifest:
```bash
sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
```
- Look for the `image:` line, and restore it to the original value
(e.g. `v1.28.9`)
- Wait for the control plane to come back up
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Upgrading the cluster with kubeadm
- Now we can let kubeadm do its job!
.lab[
- Check the upgrade plan:
```bash
sudo kubeadm upgrade plan
```
- Perform the upgrade:
```bash
sudo kubeadm upgrade apply v1.29.0
```
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Updating kubelet
- These nodes have been installed using the official Kubernetes packages
- We can therefore use `apt` or `apt-get`
.lab[
- Log into node `oldversion2`
- Update package lists and APT pins like we did before
- Then upgrade kubelet
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Checking what we've done
.lab[
- Log into node `oldversion1`
- Check node versions:
```bash
kubectl get nodes -o wide
```
- Create a deployment and scale it to make sure that the node still works
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Was that a good idea?
--
**Almost!**
--
- Yes, kubelet was installed with distribution packages
- However, kubeadm took care of configuring kubelet
(when doing `kubeadm join ...`)
- We were supposed to run a special command *before* upgrading kubelet!
- That command should be executed on each node
- It will download the kubelet configuration generated by kubeadm
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Upgrading kubelet the right way
- We need to upgrade kubeadm, upgrade kubelet config, then upgrade kubelet
(after upgrading the control plane)
.lab[
- Execute the whole upgrade procedure on each node:
```bash
for N in 1 2 3; do
ssh oldversion$N "
sudo sed -i s/1.28/1.29/ /etc/apt/sources.list.d/kubernetes.list &&
sudo rm /etc/apt/preferences.d/kubernetes &&
sudo apt update &&
sudo apt install kubeadm -y &&
sudo kubeadm upgrade node &&
sudo apt install kubelet -y"
done
```
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Checking what we've done
- All our nodes should now be updated to version 1.29
.lab[
- Check nodes versions:
```bash
kubectl get nodes -o wide
```
]
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## And now, was that a good idea?
--
**Almost!**
--
- The official recommendation is to *drain* a node before performing node maintenance
(migrate all workloads off the node before upgrading it)
- How do we do that?
- Is it really necessary?
- Let's see!
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Draining a node
- This can be achieved with the `kubectl drain` command, which will:
- *cordon* the node (prevent new pods from being scheduled there)
- *evict* all the pods running on the node (delete them gracefully)
- the evicted pods will automatically be recreated somewhere else
- evictions might be blocked in some cases (Pod Disruption Budgets, `emptyDir` volumes)
- Once the node is drained, it can safely be upgraded, restarted...
- Once it's ready, it can be put back in commission with `kubectl uncordon`
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Is it necessary?
- When upgrading kubelet from one patch-level version to another:
- it's *probably fine*
- When upgrading system packages:
- it's *probably fine*
- except [when it's not][datadog-systemd-outage]
- When upgrading the kernel:
- it's *probably fine*
- ...as long as we can tolerate a restart of the containers on the node
- ...and that they will be unavailable for a few minutes (during the reboot)
[datadog-systemd-outage]: https://www.datadoghq.com/blog/engineering/2023-03-08-deep-dive-into-platform-level-impact/
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Is it necessary?
- When upgrading kubelet from one minor version to another:
- it *may or may not be fine*
- in some cases (e.g. migrating from Docker to containerd) it *will not*
- Here's what [the documentation][node-upgrade-docs] says:
*Draining nodes before upgrading kubelet ensures that pods are re-admitted and containers are re-created, which may be necessary to resolve some security issues or other important bugs.*
- Do it at your own risk, and if you do, test extensively in staging environments!
[node-upgrade-docs]: https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/#manual-deployments
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
## Database operators to the rescue
- Moving stateful pods (e.g.: database server) can cause downtime
- Database replication can help:
- if a node contains database servers, we make sure these servers aren't primaries
- if they are primaries, we execute a *switch over*
- Some database operators (e.g. [CNPG]) will do that switch over automatically
(when they detect that a node has been *cordoned*)
[CNPG]: https://cloudnative-pg.io/
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
class: extra-details
## Skipping versions
- This example worked because we went from 1.28 to 1.29
- If you are upgrading from e.g. 1.26, you will have to go through 1.27 first
- This means upgrading kubeadm to 1.27.X, then using it to upgrade the cluster
- Then upgrading kubeadm to 1.28.X, etc.
- **Make sure to read the release notes before upgrading!**
???
:EN:- Best practices for cluster upgrades
:EN:- Example: upgrading a kubeadm cluster
:FR:- Bonnes pratiques pour la mise à jour des clusters
:FR:- Exemple : mettre à jour un cluster kubeadm
.debug[[k8s/cluster-upgrade.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-upgrade.md)]
---
class: pic
.interstitial[]
---
name: toc-backing-up-clusters
class: title
Backing up clusters
.nav[
[Previous part](#toc-upgrading-clusters)
|
[Back to table of contents](#toc-part-15)
|
[Next part](#toc-the-cloud-controller-manager)
]
.debug[(automatically generated title slide)]
---
# Backing up clusters
- Backups can have multiple purposes:
- disaster recovery (servers or storage are destroyed or unreachable)
- error recovery (human or process has altered or corrupted data)
- cloning environments (for testing, validation...)
- Let's see the strategies and tools available with Kubernetes!
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Important
- Kubernetes helps us with disaster recovery
(it gives us replication primitives)
- Kubernetes helps us clone / replicate environments
(all resources can be described with manifests)
- Kubernetes *does not* help us with error recovery
- We still need to back up/snapshot our data:
- with database backups (mysqldump, pgdump, etc.)
- and/or snapshots at the storage layer
- and/or traditional full disk backups
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## In a perfect world ...
- The deployment of our Kubernetes clusters is automated
(recreating a cluster takes less than a minute of human time)
- All the resources (Deployments, Services...) on our clusters are under version control
(never use `kubectl run`; always apply YAML files coming from a repository)
- Stateful components are either:
- stored on systems with regular snapshots
- backed up regularly to an external, durable storage
- outside of Kubernetes
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Kubernetes cluster deployment
- If our deployment system isn't fully automated, it should at least be documented
- Litmus test: how long does it take to deploy a cluster...
- for a senior engineer?
- for a new hire?
- Does it require external intervention?
(e.g. provisioning servers, signing TLS certs...)
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Plan B
- Full machine backups of the control plane can help
- If the control plane is in pods (or containers), pay attention to storage drivers
(if the backup mechanism is not container-aware, the backups can take way more resources than they should, or even be unusable!)
- If the previous sentence worries you:
**automate the deployment of your clusters!**
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Managing our Kubernetes resources
- Ideal scenario:
- never create a resource directly on a cluster
- push to a code repository
- a special branch (`production` or even `master`) gets automatically deployed
- Some folks call this "GitOps"
(it's the logical evolution of configuration management and infrastructure as code)
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## GitOps in theory
- What do we keep in version control?
- For very simple scenarios: source code, Dockerfiles, scripts
- For real applications: add resources (as YAML files)
- For applications deployed multiple times: Helm, Kustomize...
(staging and production count as "multiple times")
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## GitOps tooling
- Various tools exist (Weave Flux, GitKube...)
- These tools are still very young
- You still need to write YAML for all your resources
- There is no tool to:
- list *all* resources in a namespace
- get resource YAML in a canonical form
- diff YAML descriptions with current state
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## GitOps in practice
- Start describing your resources with YAML
- Leverage a tool like Kustomize or Helm
- Make sure that you can easily deploy to a new namespace
(or even better: to a new cluster)
- When tooling matures, you will be ready
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Plan B
- What if we can't describe everything with YAML?
- What if we manually create resources and forget to commit them to source control?
- What about global resources, that don't live in a namespace?
- How can we be sure that we saved *everything*?
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Backing up etcd
- All objects are saved in etcd
- etcd data should be relatively small
(and therefore, quick and easy to back up)
- Two options to back up etcd:
- snapshot the data directory
- use `etcdctl snapshot`
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Making an etcd snapshot
- The basic command is simple:
```bash
etcdctl snapshot save
```
- But we also need to specify:
- the address of the server to back up
- the path to the key, certificate, and CA certificate
(if our etcd uses TLS certificates)
- an environment variable to specify that we want etcdctl v3
(not necessary anymore with recent versions of etcd)
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Snapshotting etcd on kubeadm
- Here is a strategy that works on clusters deployed with kubeadm
(and maybe others)
- We're going to:
- identify a node running the control plane
- identify the etcd image
- execute `etcdctl snapshot` in a *debug container*
- transfer the resulting snapshot with another *debug container*
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Finding an etcd node and image
These commands let us retrieve the node and image automatically.
.lab[
- Get the name of a control plane node:
```bash
NODE=$(kubectl get nodes \
--selector=node-role.kubernetes.io/control-plane \
-o jsonpath={.items[0].metadata.name})
```
- Get the etcd image:
```bash
IMAGE=$(kubectl get pods --namespace kube-system etcd-$NODE \
-o jsonpath={..containers[].image})
```
]
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Making a snapshot
This relies on the fact that in a `node` debug pod:
- the host filesystem is mounted in `/host`,
- the debug pod is using the host's network.
.lab[
- Execute `etcdctl` in a debug pod:
```bash
kubectl debug --attach --profile=general \
node/$NODE --image $IMAGE -- \
etcdctl --endpoints=https://[127.0.0.1]:2379 \
--cacert=/host/etc/kubernetes/pki/etcd/ca.crt \
--cert=/host/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/host/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /host/tmp/snapshot
```
]
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Transfer the snapshot
We're going to use base64 encoding to ensure that the snapshot
doesn't get corrupted in transit.
.lab[
- Retrieve the snapshot:
```bash
kubectl debug --attach --profile=general --quiet \
node/$NODE --image $IMAGE -- \
base64 /host/tmp/snapshot | base64 -d > snapshot
```
]
We can now work with the `snapshot` file in the current directory!
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Restoring an etcd snapshot
- ~~Execute exactly the same command, but replacing `save` with `restore`~~
(Believe it or not, doing that will *not* do anything useful!)
- The `restore` command does *not* load a snapshot into a running etcd server
- The `restore` command creates a new data directory from the snapshot
(it's an offline operation; it doesn't interact with an etcd server)
- It will create a new data directory in a temporary container
(leaving the running etcd node untouched)
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## When using kubeadm
1. Create a new data directory from the snapshot:
```bash
sudo rm -rf /var/lib/etcd
docker run --rm -v /var/lib:/var/lib -v $PWD:/vol $IMAGE \
etcdctl snapshot restore /vol/snapshot --data-dir=/var/lib/etcd
```
2. Provision the control plane, using that data directory:
```bash
sudo kubeadm init \
--ignore-preflight-errors=DirAvailable--var-lib-etcd
```
3. Rejoin the other nodes
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## The fine print
- This only saves etcd state
- It **does not** save persistent volumes and local node data
- Some critical components (like the pod network) might need to be reset
- As a result, our pods might have to be recreated, too
- If we have proper liveness checks, this should happen automatically
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Accessing etcd directly
- Data in etcd is encoded in a binary format
- We can retrieve the data with etcdctl, but it's hard to read
- There is a tool to decode that data: `auger`
- Check the [use cases][auger-use-cases] for an example of how to retrieve and modify etcd data!
[auger-use-cases]: https://github.com/etcd-io/auger?tab=readme-ov-file#use-cases
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## More information about etcd backups
- [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#built-in-snapshot) about etcd backups
- [etcd documentation](https://coreos.com/etcd/docs/latest/op-guide/recovery.html#snapshotting-the-keyspace) about snapshots and restore
- [A good blog post by elastisys](https://elastisys.com/2018/12/10/backup-kubernetes-how-and-why/) explaining how to restore a snapshot
- [Another good blog post by consol labs](https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html) on the same topic
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Don't forget ...
- Also back up the TLS information
(at the very least: CA key and cert; API server key and cert)
- With clusters provisioned by kubeadm, this is in `/etc/kubernetes/pki`
- If you don't:
- you will still be able to restore etcd state and bring everything back up
- you will need to redistribute user certificates
.warning[**TLS information is highly sensitive!
Anyone who has it has full access to your cluster!**]
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Stateful services
- It's totally fine to keep your production databases outside of Kubernetes
*Especially if you have only one database server!*
- Feel free to put development and staging databases on Kubernetes
(as long as they don't hold important data)
- Using Kubernetes for stateful services makes sense if you have *many*
(because then you can leverage Kubernetes automation)
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## Snapshotting persistent volumes
- Option 1: snapshot volumes out of band
(with the API/CLI/GUI of our SAN/cloud/...)
- Option 2: storage system integration
(e.g. [Portworx](https://docs.portworx.com/portworx-install-with-kubernetes/storage-operations/create-snapshots/) can [create snapshots through annotations](https://docs.portworx.com/portworx-install-with-kubernetes/storage-operations/create-snapshots/snaps-annotations/#taking-periodic-snapshots-on-a-running-pod))
- Option 3: [snapshots through Kubernetes API](https://kubernetes.io/docs/concepts/storage/volume-snapshots/)
(Generally available since Kuberentes 1.20 for a number of [CSI](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/) volume plugins : GCE, OpenSDS, Ceph, Portworx, etc)
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
## More backup tools
- [Stash](https://appscode.com/products/stash/)
back up Kubernetes persistent volumes
- [ReShifter](https://github.com/mhausenblas/reshifter)
cluster state management
- ~~Heptio Ark~~ [Velero](https://github.com/heptio/velero)
full cluster backup
- [kube-backup](https://github.com/pieterlange/kube-backup)
simple scripts to save resource YAML to a git repository
- [bivac](https://github.com/camptocamp/bivac)
Backup Interface for Volumes Attached to Containers
???
:EN:- Backing up clusters
:FR:- Politiques de sauvegarde
.debug[[k8s/cluster-backup.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-backup.md)]
---
class: pic
.interstitial[]
---
name: toc-the-cloud-controller-manager
class: title
The Cloud Controller Manager
.nav[
[Previous part](#toc-backing-up-clusters)
|
[Back to table of contents](#toc-part-15)
|
[Next part](#toc-last-words)
]
.debug[(automatically generated title slide)]
---
# The Cloud Controller Manager
- Kubernetes has many features that are cloud-specific
(e.g. providing cloud load balancers when a Service of type LoadBalancer is created)
- These features were initially implemented in API server and controller manager
- Since Kubernetes 1.6, these features are available through a separate process:
the *Cloud Controller Manager*
- The CCM is optional, but if we run in a cloud, we probably want it!
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## Cloud Controller Manager duties
- Creating and updating cloud load balancers
- Configuring routing tables in the cloud network (specific to GCE)
- Updating node labels to indicate region, zone, instance type...
- Obtain node name, internal and external addresses from cloud metadata service
- Deleting nodes from Kubernetes when they're deleted in the cloud
- Managing *some* volumes (e.g. ELBs, AzureDisks...)
(Eventually, volumes will be managed by the Container Storage Interface)
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## In-tree vs. out-of-tree
- A number of cloud providers are supported "in-tree"
(in the main kubernetes/kubernetes repository on GitHub)
- More cloud providers are supported "out-of-tree"
(with code in different repositories)
- There is an [ongoing effort](https://github.com/kubernetes/kubernetes/tree/master/pkg/cloudprovider) to move everything to out-of-tree providers
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## In-tree providers
The following providers are actively maintained:
- Amazon Web Services
- Azure
- Google Compute Engine
- IBM Cloud
- OpenStack
- VMware vSphere
These ones are less actively maintained:
- Apache CloudStack
- oVirt
- VMware Photon
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## Out-of-tree providers
The list includes the following providers:
- DigitalOcean
- keepalived (not exactly a cloud; provides VIPs for load balancers)
- Linode
- Oracle Cloud Infrastructure
(And possibly others; there is no central registry for these.)
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## Audience questions
- What kind of clouds are you using/planning to use?
- What kind of details would you like to see in this section?
- Would you appreciate details on clouds that you don't / won't use?
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## Cloud Controller Manager in practice
- Write a configuration file
(typically `/etc/kubernetes/cloud.conf`)
- Run the CCM process
(on self-hosted clusters, this can be a DaemonSet selecting the control plane nodes)
- Start kubelet with `--cloud-provider=external`
- When using managed clusters, this is done automatically
- There is very little documentation on writing the configuration file
(except for OpenStack)
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## Bootstrapping challenges
- When a node joins the cluster, it needs to obtain a signed TLS certificate
- That certificate must contain the node's addresses
- These addresses are provided by the Cloud Controller Manager
(at least the external address)
- To get these addresses, the node needs to communicate with the control plane
- ...Which means joining the cluster
(The problem didn't occur when cloud-specific code was running in kubelet: kubelet could obtain the required information directly from the cloud provider's metadata service.)
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
## More information about CCM
- CCM configuration and operation is highly specific to each cloud provider
(which is why this section remains very generic)
- The Kubernetes documentation has *some* information:
- [architecture and diagrams](https://kubernetes.io/docs/concepts/architecture/cloud-controller/)
- [configuration](https://kubernetes.io/docs/concepts/cluster-administration/cloud-providers/) (mainly for OpenStack)
- [deployment](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/)
???
:EN:- The Cloud Controller Manager
:FR:- Le *Cloud Controller Manager*
.debug[[k8s/cloud-controller-manager.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cloud-controller-manager.md)]
---
class: pic
.interstitial[]
---
name: toc-last-words
class: title
Last words
.nav[
[Previous part](#toc-the-cloud-controller-manager)
|
[Back to table of contents](#toc-part-16)
|
[Next part](#toc-links-and-resources)
]
.debug[(automatically generated title slide)]
---
# Last words
- Congratulations!
- We learned a lot about Kubernetes, its internals, its advanced concepts
--
- That was just the easy part
- The hard challenges will revolve around *culture* and *people*
--
- ... What does that mean?
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Running an app involves many steps
- Write the app
- Tests, QA ...
- Ship *something* (more on that later)
- Provision resources (e.g. VMs, clusters)
- Deploy the *something* on the resources
- Manage, maintain, monitor the resources
- Manage, maintain, monitor the app
- And much more
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Who does what?
- The old "devs vs ops" division has changed
- In some organizations, "ops" are now called "SRE" or "platform" teams
(and they have very different sets of skills)
- Do you know which team is responsible for each item on the list on the previous page?
- Acknowledge that a lot of tasks are outsourced
(e.g. if we add "buy/rack/provision machines" in that list)
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## What do we ship?
- Some organizations embrace "you build it, you run it"
- When "build" and "run" are owned by different teams, where's the line?
- What does the "build" team ship to the "run" team?
- Let's see a few options, and what they imply
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Shipping code
- Team "build" ships code
(hopefully in a repository, identified by a commit hash)
- Team "run" containerizes that code
✔️ no extra work for developers
❌ very little advantage of using containers
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Shipping container images
- Team "build" ships container images
(hopefully built automatically from a source repository)
- Team "run" uses theses images to create e.g. Kubernetes resources
✔️ universal artefact (support all languages uniformly)
✔️ easy to start a single component (good for monoliths)
❌ complex applications will require a lot of extra work
❌ adding/removing components in the stack also requires extra work
❌ complex applications will run very differently between dev and prod
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Shipping Compose files
(Or another kind of dev-centric manifest)
- Team "build" ships a manifest that works on a single node
(as well as images, or ways to build them)
- Team "run" adapts that manifest to work on a cluster
✔️ all teams can start the stack in a reliable, deterministic manner
❌ adding/removing components still requires *some* work (but less than before)
❌ there will be *some* differences between dev and prod
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Shipping Kubernetes manifests
- Team "build" ships ready-to-run manifests
(YAML, Helm charts, Kustomize ...)
- Team "run" adjusts some parameters and monitors the application
✔️ parity between dev and prod environments
✔️ "run" team can focus on SLAs, SLOs, and overall quality
❌ requires *a lot* of extra work (and new skills) from the "build" team
❌ Kubernetes is not a very convenient development platform (at least, not yet)
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## What's the right answer?
- It depends on our teams
- existing skills (do they know how to do it?)
- availability (do they have the time to do it?)
- potential skills (can they learn to do it?)
- It depends on our culture
- owning "run" often implies being on call
- do we reward on-call duty without encouraging hero syndrome?
- do we give people resources (time, money) to learn?
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
class: extra-details
## Tools to develop on Kubernetes
*If we decide to make Kubernetes the primary development platform, here
are a few tools that can help us.*
- Docker Desktop
- Draft
- Minikube
- Skaffold
- Tilt
- ...
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Where do we run?
- Managed vs. self-hosted
- Cloud vs. on-premises
- If cloud: public vs. private
- Which vendor/distribution to pick?
- Which versions/features to enable?
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Developer experience
*These questions constitute a quick "smoke test" for our strategy:*
- How do we on-board a new developer?
- What do they need to install to get a dev stack?
- How does a code change make it from dev to prod?
- How does someone add a component to a stack?
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
## Some guidelines
- Start small
- Outsource what we don't know
- Start simple, and stay simple as long as possible
(try to stay away from complex features that we don't need)
- Automate
(regularly check that we can successfully redeploy by following scripts)
- Transfer knowledge
(make sure everyone is on the same page/level)
- Iterate!
.debug[[k8s/lastwords.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/lastwords.md)]
---
class: pic
.interstitial[]
---
name: toc-links-and-resources
class: title
Links and resources
.nav[
[Previous part](#toc-last-words)
|
[Back to table of contents](#toc-part-16)
|
[Next part](#toc-)
]
.debug[(automatically generated title slide)]
---
# Links and resources
All things Kubernetes:
- [Kubernetes Community](https://kubernetes.io/community/) - Slack, Google Groups, meetups
- [Kubernetes on StackOverflow](https://stackoverflow.com/questions/tagged/kubernetes)
- [Play With Kubernetes Hands-On Labs](https://medium.com/@marcosnils/introducing-pwk-play-with-k8s-159fcfeb787b)
All things Docker:
- [Docker documentation](http://docs.docker.com/)
- [Docker Hub](https://hub.docker.com)
- [Docker on StackOverflow](https://stackoverflow.com/questions/tagged/docker)
- [Play With Docker Hands-On Labs](http://training.play-with-docker.com/)
Everything else:
- [Local meetups](https://www.meetup.com/)
.footnote[These slides (and future updates) are on → http://container.training/]
.debug[[k8s/links.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/links.md)]
---
class: title, self-paced
Thank you!
.debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/main/slides/shared/thankyou.md)]
---
class: title, in-person
That's all, folks!
Questions?

.debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/main/slides/shared/thankyou.md)]