[key=value] [key=value] [key:=value]
```
- The `key=value` pairs get turned into a JSON object
- `key:=value` indicates a parameter to be sent "as-is"
(ideal for e.g. boolean or numbers)
.debug[[k8s/ollama-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-intro.md)]
---
## Sending some load
We're going to use `hey`:
```bash
kubectl run hey --rm -it --image nixery.dev/hey -- \
hey -c 10 -n 10 -t 60 -m POST \
-d '{"model": "qwen2:1.5b", "prompt": "vi or emacs?"}' \
http://ollama:11434/api/generate
```
Some explanations:
- `nixery.dev` = automatically generates images with [Nixery]
- `-c` = concurrent requests
- `-n` = total number of requests
- `-t` = timeout in seconds
This is probably going to take (literally) a minute.
[Nixery]: https://nixery.dev/
.debug[[k8s/ollama-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-intro.md)]
---
## Performance analysis
- Let's start an interactive container with `hey`
(e.g., use the `alpine` image, then `apk add hey`)
- Try 10 requests, with a concurrency of 1/2/4
- Meanwhile, check the logs of the `ollama` pod
- Some results (your results may vary depending on CPU, random seed...):
- 1 = 0.08 reqs/s, average latency: 12s
- 2 = 0.10 reqs/s, average latency: 18s
- 4 = 0.12 reqs/s, average latency: 28s
- Higher concurrency = slightly higher throughput, much higher latency
🤔 We need metrics!
.debug[[k8s/ollama-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-intro.md)]
---
class: pic
.interstitial[]
---
name: toc-adding-metrics
class: title
Adding metrics
.nav[
[Previous part](#toc-ollama-on-kubernetes)
|
[Back to table of contents](#toc-part-2)
|
[Next part](#toc-message-queue-architecture)
]
.debug[(automatically generated title slide)]
---
# Adding metrics
We want multiple kinds of metrics:
- instantaneous pod and node resource usage
- historical resource usage (=graphs)
- request duration
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## 1️⃣ Instantaneous resource usage
- We're going to use metrics-server
- Check if it's already installed:
```bash
kubectl top nodes
```
- If we see a list of nodes, with CPU and RAM usage:
*great, metrics-server is installed!*
- If we see `error: Metrics API not available`:
*metrics-server isn't installed, so we'll install it!*
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## Installing metrics-server
- In a lot of places, this is done with a little bit of custom YAML
(derived from the [official installation instructions](https://github.com/kubernetes-sigs/metrics-server#installation))
- We can also use a Helm chart:
```bash
helm upgrade --install metrics-server metrics-server \
--create-namespace --namespace metrics-server \
--repo https://kubernetes-sigs.github.io/metrics-server/ \
--set args={--kubelet-insecure-tls=true}
```
- The `args` flag specified above should be sufficient on most clusters
- After a minute, `kubectl top nodes` should show resource usage
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## 2️⃣ Historical resource usage
- We're going to use Prometheus (specifically: kube-prometheus-stack)
- This is a Helm chart bundling:
- Prometheus
- multiple exporters (node, kube-state-metrics...)
- Grafana
- a handful of Grafana dashboards
- Open Source
- Commercial alternatives: Datadog, New Relic...
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## Installing kube-prometheus-stack
We're going to expose both Prometheus and Grafana with a NodePort:
```bash
helm upgrade --install --repo https://prometheus-community.github.io/helm-charts \
promstack kube-prometheus-stack \
--namespace prom-system --create-namespace \
--set prometheus.service.type=NodePort \
--set grafana.service.type=NodePort \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
#
```
This chart installation can take a while (up to a couple of minutes).
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
class: extra-details
## `...NilUsersHelmValues=false` ???
- kube-prometheus-stack uses the "Prometheus Operator"
- To configure "scrape targets", we create PodMonitor or ServiceMonitor resources
- By default, the Prometheus Operator will only look at \*Monitors with the right labels
- Our extra options mean "use all the Monitors that you find!"
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## Connecting to Grafana
Check the NodePort allocated to Grafana:
```bash
kubectl get service promstack-grafana --namespace prom-system
```
Get the public address of one of our nodes:
```bash
kubectl get nodes -o wide
```
In a browser, connect to the public address of any node, on the node port.
The default login and password are `admin` / `prom-operator`.
Check the dashboard "Kubernetes / Compute Resources / Namespace (Pods)".
Select a namespace and see the CPU and RAM usage for the pods in that namespace.
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## 3️⃣ Request duration
- Unfortunately, as of November 2024, ollama doesn't expose metrics
(there is ongoing discussion about it: [issue 3144][3144], [PR 6537][6537])
- There are some [garbage AI-generated blog posts claiming otherwise][garbage]
(but it's AI-generated, so it bears no connection to truth whatsoever)
- So, what can we do?
[3144]: https://github.com/ollama/ollama/issues/3144#issuecomment-2153184254
[6537]: https://github.com/ollama/ollama/pull/6537
[garbage]: https://www.arsturn.com/blog/setting-up-ollama-prometheus-metrics
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## HAProxy to the rescue
- HAProxy is a proxy that can handle TCP, HTTP, and more
- It can expose detailed Prometheus metrics about HTTP requests
- The plan: add a sidecar HAProxy to each Ollama container
- For that, we need to give up on the Ollama Helm chart
(and go back to basic manifests)
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## 🙋 Choose your own adventure
Do we want to...
- write all the corresponding manifests?
- look at pre-written manifests and explain how they work?
- apply the manifests and carry on?
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## 🏗️ Let's build something!
- If you have created Deployments / Services: clean them up first!
- Deploy Ollama with a sidecar HAProxy (sample configuration on next slide)
- Run a short benchmark campaign
(e.g. scale to 4 pods, try 4/8/16 parallel requests, 2 minutes each)
- Check live resource usage with `kubectl top nodes` / `kubectl top pods`
- Check historical usage with the Grafana dashboards
(for HAProxy metrics, you can use [Grafana dashboard 12693, HAProxy 2 Full][grafana-12693])
- If you don't want to write the manifests, you can use [these][ollama-yaml]
[grafana-12693]: https://grafana.com/grafana/dashboards/12693-haproxy-2-full/
[ollama-yaml]: https://github.com/jpetazzo/beyond-load-balancers/tree/main/ollama
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
```
global
#log stdout format raw local0
#daemon
maxconn 32
defaults
#log global
timeout client 1h
timeout connect 1h
timeout server 1h
mode http
`option abortonclose`
frontend metrics
bind :9000
http-request use-service prometheus-exporter
frontend ollama_frontend
bind :8000
default_backend ollama_backend
`maxconn 16`
backend ollama_backend
server ollama_server localhost:11434 check
```
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
class: extra-details
## ⚠️ Connection queues
- HAProxy will happily queue *many* connections
- If a client sends a request, then disconnects:
- the request stays in the queue
- the request gets processed by the backend
- eventually, when the backend starts sending the reply, the connection is closed
- This can result in a backlog of queries that take a long time to resorb
- To avoid that: `option abortonclose` (see [HAProxy docs for details][abortonclose])
- Note that the issue is less severe when replies are streamed
[abortonclose]: https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4-option%20abortonclose
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
class: extra-details
## Ad-hoc HAProxy dashboard
- To consolidate all frontend and backend queues on a single graph:
- query: `haproxy_frontend_current_sessions`
- legend: `{{namespace}}/{{pod}}/{{proxy}}`
- options, "Color scheme", select "Classic palette (by series name)"
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
## What do we see?
- Imperfect load balancing
- Some backends receive more requests than others
- Sometimes, some backends are idle while others are busy
- However, CPU utilization on the node is maxed out
- This is because our node is oversubscribed
- This is because we haven't specified resource requests/limits (yet)
(we'll do that later!)
.debug[[k8s/ollama-metrics.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-metrics.md)]
---
class: pic
.interstitial[]
---
name: toc-message-queue-architecture
class: title
Message Queue Architecture
.nav[
[Previous part](#toc-adding-metrics)
|
[Back to table of contents](#toc-part-2)
|
[Next part](#toc-getting-started-with-bento)
]
.debug[(automatically generated title slide)]
---
# Message Queue Architecture
There are (at least) three ways to distribute load:
- load balancers
- batch jobs
- message queues
Let's do a quick review of their pros/cons!
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## 1️⃣ Load balancers
flowchart TD
Client["Client"] ---> LB["Load balancer"]
LB ---> B1["Backend"] & B2["Backend"] & B3["Backend"]
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Load balancers
- Latency: ~milliseconds (network latency)
- Overhead: very low (one extra network hop, one log message?)
- Great for short requests (a few milliseconds to a minute)
- Supported out of the box by the Kubernetes Service Proxy
(by default, this is `kube-proxy`)
- Suboptimal resource utilization due to imperfect balancing
(especially when there are multiple load balancers)
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## 2️⃣ Batch jobs
flowchart TD
subgraph K["Kubernetes Control Plane"]
J1["Job"]@{ shape: card}
J2["Job"]@{ shape: card}
J3["..."]@{ shape: text}
J4["Job"]@{ shape: card}
end
C["Client"] ---> K
K <---> N1["Node"] & N2["Node"] & N3["Node"]
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Batch jobs
- Latency: a few seconds (many Kubernetes controllers involved)
- Overhead: significant due to all the moving pieces involved
(job controller, scheduler, kubelet; many writes to etcd and logs)
- Great for long requests (a few minutes to a few days)
- Supported out of the box by Kubernetes
(`kubectl create job hello --image alpine -- sleep 60`)
- Asynchronous processing requires some refactoring
(we don't get the response immediately)
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## 3️⃣ Message queues
flowchart TD
subgraph Q["Message queue"]
M1["Message"]@{ shape: card}
M2["Message"]@{ shape: card}
M3["..."]@{ shape: text}
M4["Message"]@{ shape: card}
end
C["Client"] ---> Q
Q <---> W1["Worker"] & W2["Worker"] & W3["Worker"]
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Message queues
- Latency: a few milliseconds to a few seconds
- Overhead: intermediate
(very low with e.g. Redis, higher with e.g. Kafka)
- Great for all except very short requests
- Requires additional setup
- Asynchronous processing requires some refactoring
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Dealing with errors
- Load balancers
- errors reported immediately (client must retry)
- some load balancers can retry automatically
- Batch jobs
- Kubernetes retries automatically
- after `backoffLimit` retries, Job is marked as failed
- Message queues
- some queues have a concept of "acknowledgement"
- some queues have a concept of "dead letter queue"
- some extra work is required
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Some queue brokers
- Redis (with e.g. RPUSH, BLPOP)
*light, fast, easy to setup... no durability guarantee, no acknowledgement, no dead letter queue*
- Kafka
*heavy, complex to setup... strong deliverability guarantee, full featured*
- RabbitMQ
*somewhat in-between Redis and Kafka*
- SQL databases
*often requires polling, which adds extra latency; not as scalable as a "true" broker*
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## More queue brokers
Many cloud providers offer hosted message queues (e.g.: Amazon SQS).
These are usually great options, with some drawbacks:
- vendor lock-in
- setting up extra environments (testing, staging...) can be more complex
(Setting up a singleton environment is usually very easy, thanks to web UI, CLI, etc.; setting up extra environments and assigning the right permissions with e.g. IAC is usually significantly more complex.)
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Implementing a message queue
1. Pick a broker
2. Deploy the broker
3. Set up the queue
4. Refactor our code
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Code refactoring (client)
Before:
```python
response = http.POST("http://api", payload=Request(...))
```
After:
```python
client = queue.connect(...)
client.publish(message=Request(...))
```
Note: we don't get the response right way (if at all)!
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
## Code refactoring (server)
Before:
```python
server = http.server(request_handler=handler)
server.listen("80")
server.run()
```
After:
```python
client = queue.connect(...)
while true:
message = client.consume()
response = handler(message)
# Write the response somewhere
```
.debug[[k8s/queue-architecture.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/queue-architecture.md)]
---
class: pic
.interstitial[]
---
name: toc-getting-started-with-bento
class: title
Getting started with Bento
.nav[
[Previous part](#toc-message-queue-architecture)
|
[Back to table of contents](#toc-part-2)
|
[Next part](#toc-resource-limits)
]
.debug[(automatically generated title slide)]
---
# Getting started with Bento
How can we move to a message queue architecture...
*...without rewriting a bunch of code?*
🤔
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Bento
https://bento.dev/
"Fancy stream processing made operationally mundane"
"Written in Go, deployed as a static binary, declarative configuration. Open source and cloud native as utter heck."
With ✨ amazing ✨ documentation 😍
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
class: extra-details
## Tiny bit of history
- Original project: Benthos
- May 30, 2024: [Redpanda acquires Benthos][redpanda-acquires-benthos]
- Benthos is now Redpanda Connect
- some parts have been relicensed as commercial products
- May 31, 2024: [Warpstream forks Benthos][warpstream-forks-benthos]
- that fork is named "Bento"
- it's fully open source
- We're going to use Bento here, but Redpanda Connect should work fine too!
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Bento concepts
- Message stream processor
- Each pipeline is configured by a YAML configuration that defines:
- input (where do we get the messages?)
- pipeline (optional: how do we transform the messages?)
- output (where do we put the messages afterwards?)
- Once Bento is started, it runs the pipelines forever
(except for pipelines that have a logical end, e.g. reading from a file)
- Embedded language (Bloblang) to manipulate/transform messages
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Messages
- Typically JSON objects
(but raw strings are also possible)
- Nesting, arrays, etc. are OK
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Getting started with Bento
We're going to:
1. Import a bunch of cities from a CSV file into a Redis queue.
2. Read back these cities using a web server.
3. Use an "enrichment workflow" to query our LLM for each city.
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## 1️⃣ Importing cities
Let's break down the work:
- download the data set
- create the Bento configuration
- deploy Redis
- start Bento
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Downloading the data set
- Example database:
https://www.kaggle.com/datasets/juanmah/world-cities
- Let's download and uncompress the data set:
```bash
curl -fsSL https://www.kaggle.com/api/v1/datasets/download/juanmah/world-cities |
funzip > cities.csv
```
(Ignore the "length error", it's harmless!)
- Check the structure of the data set:
```bash
head cities.csv
```
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Creating the Bento configuration
- We need to find which `input` and `output` to use
- Check the list with `bento list` or the [documentation][bento-inputs]
- Then run `bento create INPUTNAME/PIPELINENAME/OUTPUTNAME`
- Generate a configuration file:
```bash
bento create csv//redis_list > csv2redis.yaml
```
- Edit that configuration file; look for the `(required)` parameters
(Everything else can go away!)
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Resulting configuration
If we trim all the default values, here is the result:
```yaml
input:
csv:
paths: ["cities.csv"]
output:
redis_list:
url: redis://redis:6379 # No default (required)
key: cities
```
We'll call that value `csv2redis.yaml`.
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Deploying Redis
- Create a Deployment:
```bash
kubectl create deployment redis --image redis
```
- Expose it:
```bash
kubectl expose deployment redis --port 6379
```
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Starting Bento
Option 1: run it manually in a pod, to see what's going on.
```bash
bento --config csv2redis.yaml
```
Option 2: run it with e.g. the Bento Helm chart.
*We're not going to do that yet, since this particular pipeline has a logical end.*
*(The Helm chart is best suited to pipelines that run forever.)*
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Expected output
.small[
```
INFO Running main config from specified file @service=bento bento_version="" path=csv2redis.yaml
INFO Launching a Bento instance, use CTRL+C to close @service=bento
INFO Listening for HTTP requests at: http://0.0.0.0:4195 @service=bento
INFO Input type csv is now active @service=bento label="" path=root.input
INFO Output type redis_list is now active @service=bento label="" path=root.output
INFO Pipeline has terminated. Shutting down the service @service=bento
```
]
The pipeline should complete in just a few seconds.
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Checking what's in Redis
- Connect to our Redis instance:
```bash
redis-cli -h redis
```
- List keys:
```redis
KEYS *
```
- Check that the `cities` list has approx. 47000 elements:
```redis
LLEN cities
```
- Get the first element of the list:
```redis
LINDEX cities 0
```
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Fun with Bloblang
- Let's add a filter to keep only cities with a population above 10,000,000
- Add the following block to the Bento configuration:
```yaml
pipeline:
processors:
- switch:
- check: this.population == ""
processors:
- mapping: root = deleted()
- check: this.population.int64() < 10000000
processors:
- mapping: root = deleted()
```
(See the [docs][bento-switch] for details about the `switch` processor.)
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Testing our processor
- First, delete the existing `cities` list:
```bash
redis-cli -h redis DEL cities
```
- Then, run the Bento pipeline again:
```bash
bento --config csv2redis.yaml
```
(It should complain about a few cities where the population has a decimal point.)
- Check how many cities were loaded:
```bash
redis-cli -h redis LLEN cities
```
(There should be 47.)
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## 2️⃣ Consume the queue over HTTP
- We want to "get the next city" in the queue with a simple `curl`
- Our input will be `redis_list`
- Our output will be `http_server`
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Generate the Bento configuration
Option 1: `bento create redis_list//http_server`
Option 2: [read the docs][output-http-server]
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## 🙋 Choose your own adventure
Do you want to try to write that configuration?
Or shall we see it right away?
--
⚠️ Spoilers on next slide!
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## `redis2http.yaml`
```yaml
input:
redis_list:
url: redis://redis:`6379`
key: cities
output:
http_server:
path: /nextcity
```
This will set up an HTTP route to fetch *one* city.
It's also possible to batch, stream...
⚠️ As of November 2024, `bento create` uses port 6397 instead of 6379 for Redis!
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Trying it out
- Run Bento with this configuration:
```bash
bento --config redis2http.yaml &
```
- Retrieve one city:
```bash
curl http://localhost:4195/nextcity
```
- Check what happens after we retrive *all* the cities!
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## 3️⃣ Query our LLM for each city
- We want to ask our LLM who's the mayor of each of these cities
- We'll use a prompt that will usually ensure a short answer
(so that it's faster; we don't want to wait 30 seconds per city!)
- We'll test the prompt with the Ollama CLI
- Then we'll craft a proper HTTP API query
- Finally, we'll configure an [enrichment workflow][enrichment] in Bento
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Test our prompt
Assuming that our earlier Ollama Deployment is still running:
```bash
kubectl exec deployment/ollama -- \
ollama run qwen2:1.5b "
Who is the mayor of San Francisco?
Just give the name by itself on a single line.
If you don't know, don't say anything.
"
```
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Turn the prompt into an HTTP API query
Note: to install `http` in an Alpine container, run `apk add httpie`.
```bash
http http://ollama.default:11434/api/generate \
model=qwen2:1.5b stream:=false prompt="
Who is the mayor of Paris?
Just give the name by itself on a single line.
If you don't know, don't say anything.
"
```
We get a JSON payload, and we want to use the `response` field.
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Configure an enrichment workflow
The [Bento documentation][enrichment] is really good!
We need to set up:
- a `branch` processor
- a `request_map` to transform the city into an Ollama request
- an `http` processor to submit the request to Ollama
- a `result_map` to transform the Ollama response
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Without the `branch` processor
flowchart LR
CITY["
city: Paris
country: France
population: 1106000
iso2: FR
...
"]
REQ["
model: qwen2:1.5b
stream: false
prompt: Who is the mayor of Paris?
"]
REP["
response: Anne Hidalgo
eval_count: ...
prompt_eval_count: ...
(other ollama fields)
"]
CITY@{ shape: card}
REQ@{ shape: card}
REP@{ shape: card}
style CITY text-align: left
style REQ text-align: left
style REP text-align: left
mapping@{ shape: diam }
http["http processor"]@{ shape: diam }
CITY --> mapping --> REQ --> http --> REP
- We transform the `city` into an Ollama request
- The `http` processor submits the request to Ollama
- The final output is the Ollama response
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## With the `branch` processor
flowchart LR
CITY["
city: Paris
country: France
population: 1106000
iso2: FR
...
"]
REQ["
model: qwen2:1.5b
stream: false
prompt: Who is the mayor of Paris?
"]
REP["
response: Anne Hidalgo
eval_count: ...
prompt_eval_count: ...
(other ollama fields)
"]
OUT["
city: Paris
country: France
population: 1106000
iso2: FR
...
mayor: Anne Hidalgo
"]
CITY@{ shape: card}
REQ@{ shape: card}
REP@{ shape: card}
OUT@{ shape: card}
style CITY text-align: left
style REQ text-align: left
style REP text-align: left
style OUT text-align: left
branch@{ shape: diam }
request_map@{ shape: diam }
result_map@{ shape: diam }
http["http processor"]@{ shape: diam }
CITY --> branch
branch --> result_map
branch --> request_map
request_map --> REQ
REQ --> http
http --> REP
REP --> result_map
result_map --> OUT
- The `branch` processor allows doing the processing "on the side"
- `request_map` and `result_map` transform the message before/after processing
- Then, the result is combined with the original message (the `city`)
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
```yaml
input:
csv:
paths: ["cities.csv"]
pipeline:
processors:
- branch:
request_map: |
root.model = "qwen2:1.5b"
root.stream = false
root.prompt = (
"Who is the mayor of %s? ".format(this.city) +
"Just give the name by itself on a single line. " +
"If you don't know, don't say anything."
)
processors:
- http:
url: http://ollama:11434/api/generate
verb: POST
result_map: |
root.mayor = this.response
```
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## Trying it out
- Save the YAML on the previous page into a configuration file
- Run Bento with that configuration file
- What happens?
--
🤔 We're seeing errors due to timeouts
```
ERRO HTTP request to 'http://ollama...' failed: http://ollama...:
Post "http://ollama...": context deadline exceeded
(Client.Timeout exceeded while awaiting headers)
```
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## 🙋 Choose your own adventure
How should we address errors?
- Option 1: increase the timeout in the [http][bento-http] processor
- Option 2: use a [retry][bento-retry] processor in the pipeline
- Option 3: use a [reject_errored][bento-reject] output
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## 🏗️ Let's build something!
- We want to process 1000 cities with our LLM
(guessing who the mayor is, or something similar)
- Store the output wherever we want
(Redis, CSV file, JSONL files...)
- Deal correctly with errors
(we'll check that there are, indeed, 1000 cities in the output)
- Scale out to process faster
(scale ollama to e.g. 10 replicas, enable parallelism in Bento)
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
class: title
🍱 Lunch time! 🍱
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
## What happened?
- If your Ollama pods have *resource requests*:
→ your cluster may have auto-scaled
- If your Ollama pods don't have *resource requests*:
→ you probably have a bunch of container restarts, due to out-of-memory errors
🤔 What's that about?
[bento-http]: https://warpstreamlabs.github.io/bento/docs/components/processors/http/
[bento-inputs]: https://warpstreamlabs.github.io/bento/docs/components/inputs/about/
[bento-reject]: https://warpstreamlabs.github.io/bento/docs/components/outputs/reject_errored
[bento-retry]: https://warpstreamlabs.github.io/bento/docs/components/processors/retry
[bento-switch]: https://warpstreamlabs.github.io/bento/docs/components/processors/switch/
[enrichment]: https://warpstreamlabs.github.io/bento/cookbooks/enrichments/
[output-http-server]: https://warpstreamlabs.github.io/bento/docs/components/outputs/http_server
[redpanda-acquires-benthos]: https://www.redpanda.com/press/redpanda-acquires-benthos
[warpstream-forks-benthos]: https://www.warpstream.com/blog/announcing-bento-the-open-source-fork-of-the-project-formerly-known-as-benthos
.debug[[k8s/bento-intro.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-intro.md)]
---
class: pic
.interstitial[]
---
name: toc-resource-limits
class: title
Resource Limits
.nav[
[Previous part](#toc-getting-started-with-bento)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-defining-min-max-and-default-resources)
]
.debug[(automatically generated title slide)]
---
# Resource Limits
- We can attach resource indications to our pods
(or rather: to the *containers* in our pods)
- We can specify *limits* and/or *requests*
- We can specify quantities of CPU and/or memory and/or ephemeral storage
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Requests vs limits
- *Requests* are *guaranteed reservations* of resources
- They are used for scheduling purposes
- Kubelet will use cgroups to e.g. guarantee a minimum amount of CPU time
- A container **can** use more than its requested resources
- A container using *less* than what it requested should never be killed or throttled
- A node **cannot** be overcommitted with requests
(the sum of all requests **cannot** be higher than resources available on the node)
- A small amount of resources is set aside for system components
(this explains why there is a difference between "capacity" and "allocatable")
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Requests vs limits
- *Limits* are "hard limits" (a container **cannot** exceed its limits)
- They aren't taken into account by the scheduler
- A container exceeding its memory limit is killed instantly
(by the kernel out-of-memory killer)
- A container exceeding its CPU limit is throttled
- A container exceeding its disk limit is killed
(usually with a small delay, since this is checked periodically by kubelet)
- On a given node, the sum of all limits **can** be higher than the node size
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Compressible vs incompressible resources
- CPU is a *compressible resource*
- it can be preempted immediately without adverse effect
- if we have N CPU and need 2N, we run at 50% speed
- Memory is an *incompressible resource*
- it needs to be swapped out to be reclaimed; and this is costly
- if we have N GB RAM and need 2N, we might run at... 0.1% speed!
- Disk is also an *incompressible resource*
- when the disk is full, writes will fail
- applications may or may not crash but persistent apps will be in trouble
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Running low on CPU
- Two ways for a container to "run low" on CPU:
- it's hitting its CPU limit
- all CPUs on the node are at 100% utilization
- The app in the container will run slower
(compared to running without a limit, or if CPU cycles were available)
- No other consequence
(but this could affect SLA/SLO for latency-sensitive applications!)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## CPU limits implementation details
- A container with a CPU limit will be "rationed" by the kernel
- Every `cfs_period_us`, it will receive a CPU quota, like an "allowance"
(that interval defaults to 100ms)
- Once it has used its quota, it will be stalled until the next period
- This can easily result in throttling for bursty workloads
(see details on next slide)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## A bursty example
- Web service receives one request per minute
- Each request takes 1 second of CPU
- Average load: 1.66%
- Let's say we set a CPU limit of 10%
- This means CPU quotas of 10ms every 100ms
- Obtaining the quota for 1 second of CPU will take 10 seconds
- Observed latency will be 10 seconds (... actually 9.9s) instead of 1 second
(real-life scenarios will of course be less extreme, but they do happen!)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## Multi-core scheduling details
- Each core gets a small share of the container's CPU quota
(this avoids locking and contention on the "global" quota for the container)
- By default, the kernel distributes that quota to CPUs in 5ms increments
(tunable with `kernel.sched_cfs_bandwidth_slice_us`)
- If a containerized process (or thread) uses up its local CPU quota:
*it gets more from the "global" container quota (if there's some left)*
- If it "yields" (e.g. sleeps for I/O) before using its local CPU quota:
*the quota is **soon** returned to the "global" container quota, **minus** 1ms*
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## Low quotas on machines with many cores
- The local CPU quota is not immediately returned to the global quota
- this reduces locking and contention on the global quota
- but this can cause starvation when many threads/processes become runnable
- That 1ms that "stays" on the local CPU quota is often useful
- if the thread/process becomes runnable, it can be scheduled immediately
- again, this reduces locking and contention on the global quota
- but if the thread/process doesn't become runnable, it is wasted!
- this can become a huge problem on machines with many cores
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## CPU limits in a nutshell
- Beware if you run small bursty workloads on machines with many cores!
("highly-threaded, user-interactive, non-cpu bound applications")
- Check the `nr_throttled` and `throttled_time` metrics in `cpu.stat`
- Possible solutions/workarounds:
- be generous with the limits
- make sure your kernel has the [appropriate patch](https://lkml.org/lkml/2019/5/17/581)
- use [static CPU manager policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy)
For more details, check [this blog post](https://erickhun.com/posts/kubernetes-faster-services-no-cpu-limits/) or these: ([part 1](https://engineering.indeedblog.com/blog/2019/12/unthrottled-fixing-cpu-limits-in-the-cloud/), [part 2](https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/)).
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Running low on memory
- When the kernel runs low on memory, it starts to reclaim used memory
- Option 1: free up some buffers and caches
(fastest option; might affect performance if cache memory runs very low)
- Option 2: swap, i.e. write to disk some memory of one process to give it to another
(can have a huge negative impact on performance because disks are slow)
- Option 3: terminate a process and reclaim all its memory
(OOM or Out Of Memory Killer on Linux)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Memory limits on Kubernetes
- Kubernetes *does not support swap*
(but it may support it in the future, thanks to [KEP 2400])
- If a container exceeds its memory *limit*, it gets killed immediately
- If a node memory usage gets too high, it will *evict* some pods
(we say that the node is "under pressure", more on that in a bit!)
[KEP 2400]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#implementation-history
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Running low on disk
- When the kubelet runs low on disk, it starts to reclaim disk space
(similarly to what the kernel does, but in different categories)
- Option 1: garbage collect dead pods and containers
(no consequence, but their logs will be deleted)
- Option 2: remove unused images
(no consequence, but these images will have to be repulled if we need them later)
- Option 3: evict pods and remove them to reclaim their disk usage
- Note: this only applies to *ephemeral storage*, not to e.g. Persistent Volumes!
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Ephemeral storage?
- This includes:
- the *read-write layer* of the container
(any file creation/modification outside of its volumes)
- `emptyDir` volumes mounted in the container
- the container logs stored on the node
- This does not include:
- the container image
- other types of volumes (e.g. Persistent Volumes, `hostPath`, or `local` volumes)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## Disk limit enforcement
- Disk usage is periodically measured by kubelet
(with something equivalent to `du`)
- There can be a small delay before pod termination when disk limit is exceeded
- It's also possible to enable filesystem *project quotas*
(e.g. with EXT4 or XFS)
- Remember that container logs are also accounted for!
(container log rotation/retention is managed by kubelet)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## `nodefs` and `imagefs`
- `nodefs` is the main filesystem of the node
(holding, notably, `emptyDir` volumes and container logs)
- Optionally, the container engine can be configured to use an `imagefs`
- `imagefs` will store container images and container writable layers
- When there is a separate `imagefs`, its disk usage is tracked independently
- If `imagefs` usage gets too high, kubelet will remove old images first
(conversely, if `nodefs` usage gets too high, kubelet won't remove old images)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## CPU and RAM reservation
- Kubernetes passes resources requests and limits to the container engine
- The container engine applies these requests and limits with specific mechanisms
- Example: on Linux, this is typically done with control groups aka cgroups
- Most systems use cgroups v1, but cgroups v2 are slowly being rolled out
(e.g. available in Ubuntu 22.04 LTS)
- Cgroups v2 have new, interesting features for memory control:
- ability to set "minimum" memory amounts (to effectively reserve memory)
- better control on the amount of swap used by a container
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## What's the deal with swap?
- With cgroups v1, it's not possible to disable swap for a cgroup
(the closest option is to [reduce "swappiness"](https://unix.stackexchange.com/questions/77939/turning-off-swapping-for-only-one-process-with-cgroups))
- It is possible with cgroups v2 (see the [kernel docs](https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html) and the [fbatx docs](https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html#using-swap))
- Cgroups v2 aren't widely deployed yet
- The architects of Kubernetes wanted to ensure that Guaranteed pods never swap
- The simplest solution was to disable swap entirely
- Kubelet will refuse to start if it detects that swap is enabled!
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Alternative point of view
- Swap enables paging¹ of anonymous² memory
- Even when swap is disabled, Linux will still page memory for:
- executables, libraries
- mapped files
- Disabling swap *will reduce performance and available resources*
- For a good time, read [kubernetes/kubernetes#53533](https://github.com/kubernetes/kubernetes/issues/53533)
- Also read this [excellent blog post about swap](https://jvns.ca/blog/2017/02/17/mystery-swap/)
¹Paging: reading/writing memory pages from/to disk to reclaim physical memory
²Anonymous memory: memory that is not backed by files or blocks
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Enabling swap anyway
- If you don't care that pods are swapping, you can enable swap
- You will need to add the flag `--fail-swap-on=false` to kubelet
(remember: it won't otherwise start if it detects that swap is enabled)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Pod quality of service
Each pod is assigned a QoS class (visible in `status.qosClass`).
- If limits = requests:
- as long as the container uses less than the limit, it won't be affected
- if all containers in a pod have *(limits=requests)*, QoS is considered "Guaranteed"
- If requests < limits:
- as long as the container uses less than the request, it won't be affected
- otherwise, it might be killed/evicted if the node gets overloaded
- if at least one container has *(requests<limits)*, QoS is considered "Burstable"
- If a pod doesn't have any request nor limit, QoS is considered "BestEffort"
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Quality of service impact
- When a node is overloaded, BestEffort pods are killed first
- Then, Burstable pods that exceed their requests
- Burstable and Guaranteed pods below their requests are never killed
(except if their node fails)
- If we only use Guaranteed pods, no pod should ever be killed
(as long as they stay within their limits)
(Pod QoS is also explained in [this page](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) of the Kubernetes documentation and in [this blog post](https://medium.com/google-cloud/quality-of-service-class-qos-in-kubernetes-bb76a89eb2c6).)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Specifying resources
- Resource requests are expressed at the *container* level
- CPU is expressed in "virtual CPUs"
(corresponding to the virtual CPUs offered by some cloud providers)
- CPU can be expressed with a decimal value, or even a "milli" suffix
(so 100m = 0.1)
- Memory and ephemeral disk storage are expressed in bytes
- These can have k, M, G, T, ki, Mi, Gi, Ti suffixes
(corresponding to 10^3, 10^6, 10^9, 10^12, 2^10, 2^20, 2^30, 2^40)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Specifying resources in practice
This is what the spec of a Pod with resources will look like:
```yaml
containers:
- name: blue
image: jpetazzo/color
resources:
limits:
cpu: "100m"
ephemeral-storage: 10M
memory: "100Mi"
requests:
cpu: "10m"
ephemeral-storage: 10M
memory: "100Mi"
```
This set of resources makes sure that this service won't be killed (as long as it stays below 100 MB of RAM), but allows its CPU usage to be throttled if necessary.
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Default values
- If we specify a limit without a request:
the request is set to the limit
- If we specify a request without a limit:
there will be no limit
(which means that the limit will be the size of the node)
- If we don't specify anything:
the request is zero and the limit is the size of the node
*Unless there are default values defined for our namespace!*
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## We need to specify resource values
- If we do not set resource values at all:
- the limit is "the size of the node"
- the request is zero
- This is generally *not* what we want
- a container without a limit can use up all the resources of a node
- if the request is zero, the scheduler can't make a smart placement decision
- This is fine when learning/testing, absolutely not in production!
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## How should we set resources?
- Option 1: manually, for each container
- simple, effective, but tedious
- Option 2: automatically, with the [Vertical Pod Autoscaler (VPA)][vpa]
- relatively simple, very minimal involvement beyond initial setup
- not compatible with HPAv1, can disrupt long-running workloads (see [limitations][vpa-limitations])
- Option 3: semi-automatically, with tools like [Robusta KRR][robusta]
- good compromise between manual work and automation
- Option 4: by creating LimitRanges in our Namespaces
- relatively simple, but "one-size-fits-all" approach might not always work
[robusta]: https://github.com/robusta-dev/krr
[vpa]: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
[vpa-limitations]: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#known-limitations
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-defining-min-max-and-default-resources
class: title
Defining min, max, and default resources
.nav[
[Previous part](#toc-resource-limits)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-namespace-quotas)
]
.debug[(automatically generated title slide)]
---
# Defining min, max, and default resources
- We can create LimitRange objects to indicate any combination of:
- min and/or max resources allowed per pod
- default resource *limits*
- default resource *requests*
- maximal burst ratio (*limit/request*)
- LimitRange objects are namespaced
- They apply to their namespace only
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## LimitRange example
```yaml
apiVersion: v1
kind: LimitRange
metadata:
name: my-very-detailed-limitrange
spec:
limits:
- type: Container
min:
cpu: "100m"
max:
cpu: "2000m"
memory: "1Gi"
default:
cpu: "500m"
memory: "250Mi"
defaultRequest:
cpu: "500m"
```
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Example explanation
The YAML on the previous slide shows an example LimitRange object specifying very detailed limits on CPU usage,
and providing defaults on RAM usage.
Note the `type: Container` line: in the future,
it might also be possible to specify limits
per Pod, but it's not [officially documented yet](https://github.com/kubernetes/website/issues/9585).
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## LimitRange details
- LimitRange restrictions are enforced only when a Pod is created
(they don't apply retroactively)
- They don't prevent creation of e.g. an invalid Deployment or DaemonSet
(but the pods will not be created as long as the LimitRange is in effect)
- If there are multiple LimitRange restrictions, they all apply together
(which means that it's possible to specify conflicting LimitRanges,
preventing any Pod from being created)
- If a LimitRange specifies a `max` for a resource but no `default`,
that `max` value becomes the `default` limit too
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-namespace-quotas
class: title
Namespace quotas
.nav[
[Previous part](#toc-defining-min-max-and-default-resources)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-limiting-resources-in-practice)
]
.debug[(automatically generated title slide)]
---
# Namespace quotas
- We can also set quotas per namespace
- Quotas apply to the total usage in a namespace
(e.g. total CPU limits of all pods in a given namespace)
- Quotas can apply to resource limits and/or requests
(like the CPU and memory limits that we saw earlier)
- Quotas can also apply to other resources:
- "extended" resources (like GPUs)
- storage size
- number of objects (number of pods, services...)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Creating a quota for a namespace
- Quotas are enforced by creating a ResourceQuota object
- ResourceQuota objects are namespaced, and apply to their namespace only
- We can have multiple ResourceQuota objects in the same namespace
- The most restrictive values are used
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Limiting total CPU/memory usage
- The following YAML specifies an upper bound for *limits* and *requests*:
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: a-little-bit-of-compute
spec:
hard:
requests.cpu: "10"
requests.memory: 10Gi
limits.cpu: "20"
limits.memory: 20Gi
```
These quotas will apply to the namespace where the ResourceQuota is created.
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Limiting number of objects
- The following YAML specifies how many objects of specific types can be created:
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota-for-objects
spec:
hard:
pods: 100
services: 10
secrets: 10
configmaps: 10
persistentvolumeclaims: 20
services.nodeports: 0
services.loadbalancers: 0
count/roles.rbac.authorization.k8s.io: 10
```
(The `count/` syntax allows limiting arbitrary objects, including CRDs.)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## YAML vs CLI
- Quotas can be created with a YAML definition
- ...Or with the `kubectl create quota` command
- Example:
```bash
kubectl create quota my-resource-quota --hard=pods=300,limits.memory=300Gi
```
- With both YAML and CLI form, the values are always under the `hard` section
(there is no `soft` quota)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Viewing current usage
When a ResourceQuota is created, we can see how much of it is used:
```
kubectl describe resourcequota my-resource-quota
Name: my-resource-quota
Namespace: default
Resource Used Hard
-------- ---- ----
pods 12 100
services 1 5
services.loadbalancers 0 0
services.nodeports 0 0
```
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Advanced quotas and PriorityClass
- Pods can have a *priority*
- The priority is a number from 0 to 1000000000
(or even higher for system-defined priorities)
- High number = high priority = "more important" Pod
- Pods with a higher priority can *preempt* Pods with lower priority
(= low priority pods will be *evicted* if needed)
- Useful when mixing workloads in resource-constrained environments
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Setting the priority of a Pod
- Create a PriorityClass
(or use an existing one)
- When creating the Pod, set the field `spec.priorityClassName`
- If the field is not set:
- if there is a PriorityClass with `globalDefault`, it is used
- otherwise, the default priority will be zero
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: extra-details
## PriorityClass and ResourceQuotas
- A ResourceQuota can include a list of *scopes* or a *scope selector*
- In that case, the quota will only apply to the scoped resources
- Example: limit the resources allocated to "high priority" Pods
- In that case, make sure that the quota is created in every Namespace
(or use *admission configuration* to enforce it)
- See the [resource quotas documentation][quotadocs] for details
[quotadocs]: https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-priorityclass
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-limiting-resources-in-practice
class: title
Limiting resources in practice
.nav[
[Previous part](#toc-namespace-quotas)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-cluster-autoscaler)
]
.debug[(automatically generated title slide)]
---
# Limiting resources in practice
- We have at least three mechanisms:
- requests and limits per Pod
- LimitRange per namespace
- ResourceQuota per namespace
- Let's see one possible strategy to get started with resource limits
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Set a LimitRange
- In each namespace, create a LimitRange object
- Set a small default CPU request and CPU limit
(e.g. "100m")
- Set a default memory request and limit depending on your most common workload
- for Java, Ruby: start with "1G"
- for Go, Python, PHP, Node: start with "250M"
- Set upper bounds slightly below your expected node size
(80-90% of your node size, with at least a 500M memory buffer)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Set a ResourceQuota
- In each namespace, create a ResourceQuota object
- Set generous CPU and memory limits
(e.g. half the cluster size if the cluster hosts multiple apps)
- Set generous objects limits
- these limits should not be here to constrain your users
- they should catch a runaway process creating many resources
- example: a custom controller creating many pods
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Observe, refine, iterate
- Observe the resource usage of your pods
(we will see how in the next chapter)
- Adjust individual pod limits
- If you see trends: adjust the LimitRange
(rather than adjusting every individual set of pod limits)
- Observe the resource usage of your namespaces
(with `kubectl describe resourcequota ...`)
- Rinse and repeat regularly
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Underutilization
- Remember: when assigning a pod to a node, the scheduler looks at *requests*
(not at current utilization on the node)
- If pods request resources but don't use them, this can lead to underutilization
(because the scheduler will consider that the node is full and can't fit new pods)
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Viewing a namespace limits and quotas
- `kubectl describe namespace` will display resource limits and quotas
.lab[
- Try it out:
```bash
kubectl describe namespace default
```
- View limits and quotas for *all* namespaces:
```bash
kubectl describe namespace
```
]
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
## Additional resources
- [A Practical Guide to Setting Kubernetes Requests and Limits](http://blog.kubecost.com/blog/requests-and-limits/)
- explains what requests and limits are
- provides guidelines to set requests and limits
- gives PromQL expressions to compute good values
(our app needs to be running for a while)
- [Kube Resource Report](https://codeberg.org/hjacobs/kube-resource-report)
- generates web reports on resource usage
- [nsinjector](https://github.com/blakelead/nsinjector)
- controller to automatically populate a Namespace when it is created
???
:EN:- Setting compute resource limits
:EN:- Defining default policies for resource usage
:EN:- Managing cluster allocation and quotas
:EN:- Resource management in practice
:FR:- Allouer et limiter les ressources des conteneurs
:FR:- Définir des ressources par défaut
:FR:- Gérer les quotas de ressources au niveau du cluster
:FR:- Conseils pratiques
.debug[[k8s/resource-limits.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/resource-limits.md)]
---
class: pic
.interstitial[]
---
name: toc-cluster-autoscaler
class: title
Cluster autoscaler
.nav[
[Previous part](#toc-limiting-resources-in-practice)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-autoscaling-with-keda)
]
.debug[(automatically generated title slide)]
---
# Cluster autoscaler
- When the cluster is full, we need to add more nodes
- This can be done manually:
- deploy new machines and add them to the cluster
- if using managed Kubernetes, use some API/CLI/UI
- Or automatically with the cluster autoscaler:
https://github.com/kubernetes/autoscaler
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Use-cases
- Batch job processing
"once in a while, we need to execute these 1000 jobs in parallel"
"...but the rest of the time there is almost nothing running on the cluster"
- Dynamic workload
"a few hours per day or a few days per week, we have a lot of traffic"
"...but the rest of the time, the load is much lower"
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Pay for what you use
- The point of the cloud is to "pay for what you use"
- If you have a fixed number of cloud instances running at all times:
*you're doing in wrong (except if your load is always the same)*
- If you're not using some kind of autoscaling, you're wasting money
(except if you like lining the pockets of your cloud provider)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Running the cluster autoscaler
- We must run nodes on a supported infrastructure
- Check the [GitHub repo][autoscaler-providers] for a non-exhaustive list of supported providers
- Sometimes, the cluster autoscaler is installed automatically
(or by setting a flag / checking a box when creating the cluster)
- Sometimes, it requires additional work
(which is often non-trivial and highly provider-specific)
[autoscaler-providers]: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Scaling up in theory
IF a Pod is `Pending`,
AND adding a Node would allow this Pod to be scheduled,
THEN add a Node.
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Fine print 1
*IF a Pod is `Pending`...*
- First of all, the Pod must exist
- Pod creation might be blocked by e.g. a namespace quota
- In that case, the cluster autoscaler will never trigger
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Fine print 2
*IF a Pod is `Pending`...*
- If our Pods do not have resource requests:
*they will be in the `BestEffort` class*
- Generally, Pods in the `BestEffort` class are schedulable
- except if they have anti-affinity placement constraints
- except if all Nodes already run the max number of pods (110 by default)
- Therefore, if we want to leverage cluster autoscaling:
*our Pods should have resource requests*
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Fine print 3
*AND adding a Node would allow this Pod to be scheduled...*
- The autoscaler won't act if:
- the Pod is too big to fit on a single Node
- the Pod has impossible placement constraints
- Examples:
- "run one Pod per datacenter" with 4 pods and 3 datacenters
- "use this nodeSelector" but no such Node exists
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Trying it out
- We're going to check how much capacity is available on the cluster
- Then we will create a basic deployment
- We will add resource requests to that deployment
- Then scale the deployment to exceed the available capacity
- **The following commands require a working cluster autoscaler!**
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Checking available resources
.lab[
- Check how much CPU is allocatable on the cluster:
```bash
kubectl get nodes -o jsonpath={..allocatable.cpu}
```
]
- If we see e.g. `2800m 2800m 2800m`, that means:
3 nodes with 2.8 CPUs allocatable each
- To trigger autoscaling, we will create 7 pods requesting 1 CPU each
(each node can fit 2 such pods)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Creating our test Deployment
.lab[
- Create the Deployment:
```bash
kubectl create deployment blue --image=jpetazzo/color
```
- Add a request for 1 CPU:
```bash
kubectl patch deployment blue --patch='
spec:
template:
spec:
containers:
- name: color
resources:
requests:
cpu: 1
'
```
]
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Scaling up in practice
- This assumes that we have strictly less than 7 CPUs available
(adjust the numbers if necessary!)
.lab[
- Scale up the Deployment:
```bash
kubectl scale deployment blue --replicas=7
```
- Check that we have a new Pod, and that it's `Pending`:
```bash
kubectl get pods
```
]
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Cluster autoscaling
- After a few minutes, a new Node should appear
- When that Node becomes `Ready`, the Pod will be assigned to it
- The Pod will then be `Running`
- Reminder: the `AGE` of the Pod indicates when the Pod was *created*
(it doesn't indicate when the Pod was scheduled or started!)
- To see other state transitions, check the `status.conditions` of the Pod
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Scaling down in theory
IF a Node has less than 50% utilization for 10 minutes,
AND all its Pods can be scheduled on other Nodes,
AND all its Pods are *evictable*,
AND the Node doesn't have a "don't scale me down" annotation¹,
THEN drain the Node and shut it down.
.footnote[¹The annotation is: `cluster-autoscaler.kubernetes.io/scale-down-disabled=true`]
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## When is a Pod "evictable"?
By default, Pods are evictable, except if any of the following is true.
- They have a restrictive Pod Disruption Budget
- They are "standalone" (not controlled by a ReplicaSet/Deployment, StatefulSet, Job...)
- They are in `kube-system` and don't have a Pod Disruption Budget
- They have local storage (that includes `EmptyDir`!)
This can be overridden by setting the annotation:
`cluster-autoscaler.kubernetes.io/safe-to-evict`
(it can be set to `true` or `false`)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Pod Disruption Budget
- Special resource to configure how many Pods can be *disrupted*
(i.e. shutdown/terminated)
- Applies to Pods matching a given selector
(typically matching the selector of a Deployment)
- Only applies to *voluntary disruption*
(e.g. cluster autoscaler draining a node, planned maintenance...)
- Can express `minAvailable` or `maxUnavailable`
- See [documentation][doc-pdb] for details and examples
[doc-pdb]: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Local storage
- If our Pods use local storage, they will prevent scaling down
- If we have e.g. an `EmptyDir` volume for caching/sharing:
make sure to set the `.../safe-to-evict` annotation to `true`!
- Even if the volume...
- ...only has a PID file or UNIX socket
- ...is empty
- ...is not mounted by any container in the Pod!
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Expensive batch jobs
- Careful if we have long-running batch jobs!
(e.g. jobs that take many hours/days to complete)
- These jobs could get evicted before they complete
(especially if they use less than 50% of the allocatable resources)
- Make sure to set the `.../safe-to-evict` annotation to `false`!
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Node groups
- Easy scenario: all nodes have the same size
- Realistic scenario: we have nodes of different sizes
- e.g. mix of CPU and GPU nodes
- e.g. small nodes for control plane, big nodes for batch jobs
- e.g. leveraging spot capacity
- The cluster autoscaler can handle it!
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
class: extra-details
## Leveraging spot capacity
- AWS, Azure, and Google Cloud are typically more expensive then their competitors
- However, they offer *spot* capacity (spot instances, spot VMs...)
- *Spot* capacity:
- has a much lower cost (see e.g. AWS [spot instance advisor][awsspot])
- has a cost that varies continuously depending on regions, instance type...
- can be preempted at all times
- To be cost-effective, it is strongly recommended to leverage spot capacity
[awsspot]: https://aws.amazon.com/ec2/spot/instance-advisor/
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Node groups in practice
- The cluster autoscaler maps nodes to *node groups*
- this is an internal, provider-dependent mechanism
- the node group is sometimes visible through a proprietary label or annotation
- Each node group is scaled independently
- The cluster autoscaler uses [expanders] to decide which node group to scale up
(the default expander is "random", i.e. pick a node group at random!)
- Of course, only acceptable node groups will be considered
(i.e. node groups that could accommodate the `Pending` Pods)
[expanders]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
class: extra-details
## Scaling to zero
- *In general,* a node group needs to have at least one node at all times
(the cluster autoscaler uses that node to figure out the size, labels, taints... of the group)
- *On some providers,* there are special ways to specify labels and/or taints
(but if you want to scale to zero, check that the provider supports it!)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Warning
- Autoscaling up is easy
- Autoscaling down is harder
- It might get stuck because Pods are not evictable
- Do at least a dry run to make sure that the cluster scales down correctly!
- Have alerts on cloud spend
- *Especially when using big/expensive nodes (e.g. with GPU!)*
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Preferred vs. Required
- Some Kubernetes mechanisms allow to express "soft preferences":
- affinity (`requiredDuringSchedulingIgnoredDuringExecution` vs `preferredDuringSchedulingIgnoredDuringExecution`)
- taints (`NoSchedule`/`NoExecute` vs `PreferNoSchedule`)
- Remember that these "soft preferences" can be ignored
(and given enough time and churn on the cluster, they will!)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Troubleshooting
- The cluster autoscaler publishes its status on a ConfigMap
.lab[
- Check the cluster autoscaler status:
```bash
kubectl describe configmap --namespace kube-system cluster-autoscaler-status
```
]
- We can also check the logs of the autoscaler
(except on managed clusters where it's running internally, not visible to us)
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Acknowledgements
Special thanks to [@s0ulshake] for their help with this section!
If you need help to run your data science workloads on Kubernetes,
they're available for consulting.
(Get in touch with them through https://www.linkedin.com/in/ajbowen/)
[@s0ulshake]: https://twitter.com/s0ulshake
.debug[[k8s/cluster-autoscaler.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/cluster-autoscaler.md)]
---
## Setting resource requests and limits
- Thanks to *requests*:
- our pods will have resources *reserved* for them
- we won't pack too many pods on a single node
- cluster autoscaling will trigger when needed (if possible!)
- Thanks to *limits*:
- our pods won't use more than a given amount of resources
- they won't use up all the available resources on the node
- behavior will be more consistent between loaded and unloaded state
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## Memory
- Personal advice: set request and limit to the same value
- Check current or historical usage and add a bit of padding
(the more data historical data we have, the less padding we need)
- Consider 10% padding for "dataless" pods, more for pods with data
(so that the pod has "reserves" for page cache usage)
⚠️ Pods hitting their memory limit will be **killed!**
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## CPU
- It's not necessary to set requests and limits to the same value
(this would cause a lot of waste for idle workloads)
- Let's see a few possible strategies!
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## CPU for mostly idle pods
E.g.: web services, workers handling very few requests...
- Set the limit to at least one whole core
(to avoid throttling, especially on bursty workloads)
- Requests can be very low (e.g. 0.1 core)
⚠️ If requests are too low and the node is very loaded,
the pod will slow down significantly!
(Because CPU cycles are allocated proportionally to CPU requests.)
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## Inelastic CPU-hungry pods
- Pods with a fixed number of threads:
*set requests and limits to that number of threads*
- Pods where a specific level of performance needs to be guaranteed:
*set requests and limits to the number of cores providing that performance*
⚠️ If you set limits to higher levels, performance will be unpredictible!
(You'll get good performance when the node has extra cycles.)
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## Elastic CPU-hungry pods
- Pods that could potentially use all the cores
(e.g. machine learning training and inference, depending on the models)
- Decide how many pods per node you want to pack
- Set CPU requests as a fraction of the number of cores of the nodes
(minus some padding)
- Example:
- nodes with 32 cores
- we want 4 pods per node
- CPU request: 7.5 cores
- Set limits to a higher level (up to node size)
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## In practice
- Check memory usage of our Ollama pods:
```bash
kubectl top pods
```
(Or even better, look at historical usage in Prometheus or Grafana!)
- Check how many cores we have on our nodes:
```bash
kubectl get nodes -o json | jq .items[].status.capacity.cpu
kubectl get nodes -o custom-columns=NAME:metadata.name,CPU:status.capacity.cpu
```
- Let's decide that we want two Ollama pods per node
- What requests/limits should we set?
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## Setting resources for Ollama
- Assumptions:
- we want two pods per node
- each pod uses ~1500MiB RAM
- nodes have 4 cores
- We'll set memory requests and limits to 2G
- We'll set CPU requests to 1.5 (4 cores / 2 pods, minus padding)
- We'll set CPU limits to twice the requests
```bash
kubectl set resources deployment ollama \
--requests=cpu=1.5,memory=2G \
--limits=cpu=3,memory=2G
```
⚠️ If you have an HAProxy side car, this will set its resources too!
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
## Results
- After setting these resource requests, we should see cluster autoscaling
- If not: scale up the Ollama Deployment to at least 3 replicas
- Check cluster autoscaler status with:
```bash
kubectl describe configmap --namespace kube-system cluster-autoscaler-status
```
.debug[[k8s/ollama-reqlim.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/ollama-reqlim.md)]
---
class: pic
.interstitial[]
---
name: toc-autoscaling-with-keda
class: title
Autoscaling with KEDA
.nav[
[Previous part](#toc-cluster-autoscaler)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-bento--rabbitmq)
]
.debug[(automatically generated title slide)]
---
# Autoscaling with KEDA
- Cluster autoscaling = automatically add nodes *when needed*
- *When needed* = when Pods are `Pending`
- How do these pods get created?
- When the Ollama Deployment is scaled up
- ... manually (e.g. `kubectl scale`)
- ... automatically (that's what we want to investigate now!)
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Ways to implement autoscaling
- Custom code
(e.g. crontab checking some value every few minutes and scaling accordingly)
- Kubernetes Horizontal Pod Autoscaler v1
(aka `kubectl autoscale`)
- Kubernetes Horizontal Pod Autoscaler v2 with custom metrics
(e.g. with Prometheus Adapter)
- Kubernetes Horizontal Pod Autoscaler v2 with external metrics
(e.g. with KEDA)
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Custom code
- No, we're not going to do that!
- But this would be an interesting exercise in RBAC
(setting minimal amount of permissions for the pod running our custom code)
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv1
Pros: very straightforward
Cons: can only scale on CPU utilization
How it works:
- periodically measures average CPU *utilization* across pods
- if utilization is above/below a target (default: 80%), scale up/down
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv1 in practice
- Create the autoscaling policy:
```bash
kubectl autoscale deployment ollama --max=1000
```
(The `--max` is required; it's a safety limit.)
- Check it:
```bash
kubectl describe hpa
```
- Send traffic, wait a bit: pods should be created automatically
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv2 custom vs external
- Custom metrics = arbitrary metrics attached to Kubernetes objects
- External metrics = arbitrary metrics not related to Kubernetes objects
--
🤔
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv2 custom metrics
- Examples:
- on Pods: CPU, RAM, network traffic...
- on Ingress: requests per second, HTTP status codes, request duration...
- on some worker Deployment: number of tasks processed, task duration...
- Requires an *adapter* to:
- expose the metrics through the Kubernetes *aggregation layer*
- map the actual metrics source to Kubernetes objects
Example: the [Prometheus adapter][prometheus-adapter]
[prometheus-adapter]: https://github.com/kubernetes-sigs/prometheus-adapter
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv2 custom metrics in practice
- We're not going to cover this here
(too complex / not enough time!)
- If you want more details, check [my other course material][hpav2slides]
[hpav2slides]: https://2024-10-enix.container.training/4.yml.html#toc-scaling-with-custom-metrics
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv2 external metrics
- Examples:
- arbitrary Prometheus query
- arbitrary SQL query
- number of messages in a queue
- and [many, many more][keda-scalers]
- Also requires an extra components to expose the metrics
Example: [KEDA (https://keda.sh/)](https://keda.sh)
[keda-scalers]: https://keda.sh/docs/latest/scalers/
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## HPAv2 external metrics in practice
- We're going to install KEDA
- And set it up to autoscale depending on the number of messages in Redis
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Installing KEDA
Multiple options (details in the [documentation][keda-deploy]):
- YAML
- Operator Hub
- Helm chart 💡
```bash
helm upgrade --install --repo https://kedacore.github.io/charts \
--namespace keda-system --create-namespace keda keda
```
[keda-deploy]: https://keda.sh/docs/latest/deploy/
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Scaling according to Redis
- We need to create a KEDA Scaler
- This is done with a "ScaledObject" manifest
- [Here is the documentation][keda-redis-lists] for the Redis Lists Scaler
- Let's write that manifest!
[keda-redis-lists]: https://keda.sh/docs/latest/scalers/redis-lists/
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## `keda-redis-scaler.yaml`
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ollama
spec:
scaleTargetRef:
name: ollama
triggers:
- type: redis
metadata:
address: redis.`default`.svc:6379
listName: cities
listLength: "10"
```
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Notes
- We need to update the `address` field with our namespace
(unless we are running in the `default` namespace)
- Alternative: use `addressFromEnv` and set an env var in the Ollama pods
- `listLength` gives the target ratio of `messages / replicas`
- In our example, KEDA will scale the Deployment to `messages / 100`
(rounded up!)
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Trying it out
- Apply the ScaledObject manifest
- Start a Bento pipeline loading e.g. 100-1000 cities in Redis
(100 on smaller clusters / slower CPUs, 1000 on bigger / faster ones)
- Check pod and nod resource usage
- What do we see?
--
🤩 The Deployment scaled up automatically!
--
🤔 But Pod resource usage remains very low (A few busy pods, many idle)
--
💡 Bento doesn't submit enough requests in parallel!
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Improving throughput
We're going to review multiple techniques:
1. Increase parallelism inside the Bento pipeline.
2. Run multiple Bento consumers.
3. Couple consumers and processors more tightly.
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## 1️⃣ Increase pipeline parallelism
- Set `parallel` to `true` in the `http` processor
- Wrap the input around a `batched` input
(otherwise, we don't have enough messages in flight)
- Increase `http` timeout significantly (e.g. to 5 minutes)
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Results
🎉 More messages flow through the pipeline
🎉 Many requests happen in parallel
🤔 Average Pod and Node CPU utilization is higher, but not maxed out
🤔 HTTP queue size (measured with HAProxy metrics) is relatively high
🤔 Latency is higher too
Why?
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Too many requests in parallel
- Ealier, we didn't have enough...
- ...Now, we have too much!
- However, for a very big request queue, it still wouldn't be enough
💡 We currently have a fixed parallelism. We need to make it dynamic!
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## 2️⃣ Run multiple Bento consumers
- Restore the original Bento configuration
(flip `parallel` back to `false`; remove the `batched` input)
- Run Bento in a Deployment
(e.g. with the [Bento Helm chart][bento-helm-chart])
- Autoscale that Deployment like we autoscaled the Ollama Deployment
[bento-helm-chart]: https://github.com/warpstreamlabs/bento-helm-chart
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Results
🤔🤔🤔 Pretty much the same as before!
(High throughput, high utilization but not maxed out, high latency...)
--
🤔🤔🤔 Why?
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Unbalanced load balancing
- All our requests go through the `ollama` Service
- We're still using the default Kubernetes service proxy!
- It doesn't spread the requests properly across all the backends
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## 3️⃣ Couple consumers and processors
What if:
--
instead of sending requests to a load balancer,
--
each queue consumer had its own Ollama instance?
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Current architecture
flowchart LR
subgraph P1["Pod"]
H1["HAProxy"] --> O1["Ollama"]
end
subgraph P2["Pod"]
H2["HAProxy"] --> O2["Ollama"]
end
subgraph P3["Pod"]
H3["HAProxy"] --> O3["Ollama"]
end
Q["Queue
(Redis)"] <--> C["Consumer
(Bento)"] --> LB["Load Balancer
(kube-proxy)"]
LB --> H1 & H2 & H3
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Proposed architecture
flowchart LR
subgraph P1["Consumer Pod"]
C1["Bento"] --> H1["HAProxy"] --> O1["Ollama"]
end
subgraph P2["Consumer Pod"]
C2["Bento"] --> H2["HAProxy"] --> O2["Ollama"]
end
subgraph P3["Consumer Pod"]
C3["Bento"] --> H3["HAProxy"] --> O3["Ollama"]
end
Queue["Queue"] <--> C1 & C2 & C3
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## 🏗️ Let's build something!
- Let's implement that architecture!
- See next slides for hints / getting started
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Hints
We need to:
- Update the Bento consumer configuration to talk to localhost
- Store that configuration in a ConfigMap
- Add a Bento container to the Ollama Deployment
- Profit!
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Results
🎉 Node and Pod utilization is maximized
🎉 HTTP queue size is bounded
🎉 Deployment autoscales up and down
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## ⚠️ Scaling down
- Eventually, there are less messages in the queue
- The HPA scales down the Ollama Deployment
- This terminates some Ollama Pods
🤔 What happens if these Pods were processing requests?
--
- The requests might be lost!
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
## Avoiding lost messages
Option 1:
- cleanly shutdown the consumer
- make sure that Ollama can complete in-flight requests
(by extending its grace period)
- find a way to terminate Ollama when no more requests are in flight
Option 2:
- use *message acknowledgement*
.debug[[k8s/bento-hpa.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-hpa.md)]
---
class: pic
.interstitial[]
---
name: toc-bento--rabbitmq
class: title
Bento & RabbitMQ
.nav[
[Previous part](#toc-autoscaling-with-keda)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-bento--postgresql)
]
.debug[(automatically generated title slide)]
---
# Bento & RabbitMQ
- In some of the previous runs, messages were dropped
(we start with 1000 messages in `cities` and have e.g. 955 in `mayors`)
- This is caused by various errors during processing
(e.g. too many timeouts; Bento being shutdown halfway through...)
- ...And by the fact that we are using a Redis queue
(which doesn't offer delivery guarantees or acknowledgements)
- Can we get something better?
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## The problem
- Some inputs (like `redis_list`) don't support *acknowledgements*
- When a message is pulled from the queue, it is deleted immediately
- If the message is lost for any reason, it is lost permanently
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## The solution
- Some inputs (like `amqp_0_9`) support acknowledgements
- When a message is pulled from the queue:
- it is not visible anymore to other consumers
- it needs to be explicitly acknowledged
- The acknowledgement is done by Bento when the message reaches the output
- The acknowledgement deletes the message
- No acknowledgement after a while? Consumer crashes/disconnects?
Message gets requeued automatically!
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## `amqp_0_9`
- Protocol used by RabbitMQ
- Very simplified behavior:
- messages are published to an [*exchange*][amqp-exchanges]
- messages have a *routing key*
- the exchange routes the message to one (or zero or more) queues
(possibly using the routing key or message headers to decide which queue(s))
- [*consumers*][amqp-consumers] subscribe to queues to receive messages
[amqp-exchanges]: https://www.rabbitmq.com/tutorials/amqp-concepts#exchanges
[amqp-consumers]: https://www.rabbitmq.com/tutorials/amqp-concepts#consumers
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## Using the default exchange
- There is a default exchange (called `""` - empty string)
- The routing key indicates the name of the queue to deliver to
- The queue needs to exist (we need to create it beforehand)
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
class: extra-details
## Defining custom exchanges
- Create an exchange
- exchange types: direct, fanout, topic, headers
- durability: persisted to disk to survive server restart or not?
- Create a binding
- which exchange?
- which routing key? (for direct exchanges)
- which queue?
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## RabbitMQ on Kubernetes
- RabbitMQ can be deployed on Kubernetes:
- directly (creating e.g. a StatefulSet)
- with the RabbitMQ operator
- We're going to do the latter!
- The operator includes the "topology operator"
(to configure queues, exchanges, and bindings through custom resources)
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## Installing the RabbitMQ operator
- Let's install it with this Helm chart:
```bash
helm upgrade --install --repo https://charts.bitnami.com/bitnami \
--namespace rabbitmq-system --create-namespace \
rabbitmq-cluster-operator rabbitmq-cluster-operator
```
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## Deploying a simple RabbitMQ cluster
- Let's use the YAML manifests in that directory:
https://github.com/jpetazzo/beyond-load-balancers/tree/main/rabbitmq
- This creates:
- a `RabbitmqCluster` called `mq`
- a `Secret` called `mq-default-user` containing access credentials
- a durable `Queue` named `q1`
(We can ignore the `Exchange` and the `Binding`, we won't use them.)
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## 🏗️ Let's build something!
Let's replace the `cities` Redis list with our RabbitMQ queue.
(See next slide for steps and hints!)
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## Steps
1. Edit the Bento configuration for our "CSV importer".
(replace the `redis_list` output with `amqp_0_9`)
2. Run that pipeline and confirm that messages show up in RabbitMQ.
3. Edit the Bento configuration for the Ollama consumer.
(replace the `redis_list` input with `amqp_0_9`)
4. Trigger a scale up of the Ollama consumer.
5. Update the KEDA Scaler to use RabbitMQ instead of Redis.
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## 1️⃣ Sending messages to RabbitMQ
- Edit our Bento configuration (the one feeding the CSV file to Redis)
- We want the following `output` section:
```yaml
output:
amqp_0_9:
exchange: ""
key: q1
mandatory: true
urls:
- "${AMQP_URL}"
```
- Then export the AMQP_URL environment variable using `connection_string` from Secret `mq-default-user`
💡 Yes, we can directly use environment variables in Bento configuration!
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## 2️⃣ Testing our AMQP output
- Run the Bento pipeline
- To check that our messages made it:
```bash
kubectl exec mq-server-0 -- rabbitmqctl list_queues
```
- We can also use Prometheus metrics, e.g. `rabbitmq_queue_messages`
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## 3️⃣ Receiving messages from RabbitMQ
- Edit our other Bento configuration (the one in the Ollama consumer Pod)
- We want the following `input` section:
```yaml
input:
amqp_0_9:
urls:
- `amqp://...:5672/`
queue: q1
```
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## 4️⃣ Triggering Ollama scale up
- If the autoscaler is configured to scale to zero, disable it
(easiest solution: delete the ScaledObject)
- Then manually scale the Deployment to e.g. 4 Pods
- Check that messages are processed and show up in the output
(it should still be a Redis list at this point)
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
## 5️⃣ Autoscaling on RabbitMQ
- We need to update our ScaledObject
- Check the [RabbitMQ Queue Scaler][keda-rabbitmq]
- Multiple ways to pass the AMQP URL:
- hardcode it (easier solution for testing!)
- use `...fromEnv` and set environment variables in target pod
- create and use a TriggerAuthentication
💡 Since we have the AMQP URL in a Secret, TriggerAuthentication works great!
[keda-rabbitmq]: https://keda.sh/docs/latest/scalers/rabbitmq-queue/
.debug[[k8s/bento-rmq.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-rmq.md)]
---
class: pic
.interstitial[]
---
name: toc-bento--postgresql
class: title
Bento & PostgreSQL
.nav[
[Previous part](#toc-bento--rabbitmq)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-managing-our-stack-with-helmfile)
]
.debug[(automatically generated title slide)]
---
# Bento & PostgreSQL
- Bento can also use SQL databases for input/output
- We're going to demonstrate that by writing to a PostgreSQL database
- That database will be deployed with the Cloud Native PostGres operator
(https://cloudnative-pg.io/)
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## CNPG in a nutshell
- Free, open source
- Originally created by [EDB] (EnterpriseDB, well-known PgSQL experts)
- Non-exhaustive list of features:
- provisioning of Postgres servers, replicas, bouncers
- automatic failover
- backups (full backups and WAL shipping)
- provisioning from scratch, from backups, PITR
- manual and automated switchover (e.g. for node maintenance)
- and many more!
[EDB]: https://www.enterprisedb.com/workload/kubernetes
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## What we're going to do
1. Install CNPG.
2. Provision a Postgres cluster.
3. Configure Bento to write to that cluster.
4. Set up a Grafana dashboard to see the data.
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## 1️⃣ Installing CNPG
Many options available, see the [documentation][cnpg-install]:
- raw YAML manifests
- kubectl CNPG plugin (`kubectl cnpg install generate`)
- Helm chart
- OLM
[cnpg-install]: https://cloudnative-pg.io/documentation/1.24/installation_upgrade/
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## 2️⃣ Provisioning a Postgres cluster
Minimal manifest:
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: db
spec:
storage:
size: 1Gi
```
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
class: extra-details
## For production...
We might also add:
- `spec.monitoring.enablePodMonitor: true`
- `spec.instances: 2`
- `resources.{requests,limits}.{cpu,memory}`
- `walStorage.size`
- `backup`
- `postgresql.parameters`
See [this manifest][cluster-maximal] for a detailed example.
[cluster-maximal]: https://github.com/jpetazzo/pozok/blob/main/cluster-maximal.yaml
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## 3️⃣ Configuring Bento to write to SQL
- We'll use the [`sql_insert`][sql-insert] output
- If our cluster is named `mydb`, there will be a Secret `mydb-app`
- This Secret will contain a `uri` field
- That field can be used as the `dns` in the Bento configuration
- We will also need to create the table that we want to use
(see next slide for instructions)
[sql-insert]: https://warpstreamlabs.github.io/bento/docs/components/outputs/sql_insert
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## Creating a table
- If we just want to store the city name and its population:
```sql
CREATE TABLE IF NOT EXISTS cities (
city varchar(100) NOT NULL,
population integer
);
```
- This statement can be executed:
- manually, by getting a `psql` shell with `kubectl cnpg psql mydb app`
- automatically, with Bento's `init_statatement`
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
## 4️⃣ Viewing the table in Grafana
- In Grafana, in the home menu on the lift, click "connections"
- Add a PostgreSQL data source
- Enter the host:port, database, user, password
- Then add a visualization using that data source
(it should be relatively self-explanatory!)
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
class: extra-details
## Automating it all
- Expose PostgreSQL credentials through environment variables
(in the Bento container)
- Use the `${...}` syntax in Bento to use these environment variables
- Export the Grafana dashboard to a JSON file
- Store the JSON file in a ConfigMap, with label `grafana_dashboard=1`
- Create that ConfigMap in the namespace where Grafana is running
- Similarly, data sources (like the Redis and the PostgreSQL one) can be defined in YAML
- And that YAML can be put in a ConfigMap with label `grafana_datasource=1`
.debug[[k8s/bento-cnpg.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/bento-cnpg.md)]
---
class: pic
.interstitial[]
---
name: toc-managing-our-stack-with-helmfile
class: title
Managing our stack with `helmfile`
.nav[
[Previous part](#toc-bento--postgresql)
|
[Back to table of contents](#toc-part-3)
|
[Next part](#toc-)
]
.debug[(automatically generated title slide)]
---
# Managing our stack with `helmfile`
- We've installed a few things with Helm
- And others with raw YAML manifests
- Perhaps you've used Kustomize sometimes
- How can we automate all this? Make it reproducible?
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## Requirements
- We want something that is *idempotent*
= running it 1, 2, 3 times, should only install the stack once
- We want something that handles udpates
= modifying / reconfiguring without restarting from scratch
- We want something that is configurable
= with e.g. configuration files, environment variables...
- We want something that can handle *partial removals*
= ability to remove one element without affecting the rest
- Inspiration: Terraform, Docker Compose...
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## Shell scripts?
✅ Idempotent, thanks to `kubectl apply -f`, `helm upgrade --install`
✅ Handles updates (edit script, re-run)
✅ Configurable
❌ Partial removals
If we remove an element from our script, it won't be uninstalled automatically.
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## Umbrella chart?
Helm chart with dependencies on other charts.
✅ Idempotent
✅ Handles updates
✅ Configurable (with Helm values: YAML files and `--set`)
✅ Partial removals
❌ Complex (requires to learn advanced Helm features)
❌ Requires everything to be a Helm chart (adds (lots of) boilerplate)
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## Helmfile
https://github.com/helmfile/helmfile
✅ Idempotent
✅ Handles updates
✅ Configurable (with values files, environment variables, and more)
✅ Partial removals
✅ Fairly easy to get started
🐙 Sometimes feels like summoning unspeakable powers / staring down the abyss
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## What `helmfile` can install
- Helm charts from remote Helm repositories
- Helm charts from remote git repositories
- Helm charts from local directories
- Kustomizations
- Directories with raw YAML manifests
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## How `helmfile` works
- Everything is defined in a main `helmfile.yaml`
- That file defines:
- `repositories` (remote Helm repositories)
- `releases` (things to install: Charts, YAML...)
- `environments` (optional: to specialize prod vs staging vs ...)
- Helm-style values file can be loaded in `enviroments`
- These values can then be used in the rest of the Helmfile
- Examples: [install essentials on a cluster][helmfile-ex-1], [run a Bento stack][helmfile-ex-2]
[helmfile-ex-1]: https://github.com/jpetazzo/beyond-load-balancers/blob/main/helmfile.yaml
[helmfile-ex-2]: https://github.com/jpetazzo/beyond-load-balancers/blob/main/bento/helmfile.yaml
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## `helmfile` commands
- `helmfile init` (optional; downloads plugins if needed)
- `helmfile apply` (updates all releases that have changed)
- `helmfile sync` (updates all releases even if they haven't changed)
- `helmfile destroy` (guess!)
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## Helmfile tips
As seen in [this example](https://github.com/jpetazzo/beyond-load-balancers/blob/main/bento/helmfile.yaml#L21):
- variables can be used to simplify the file
- configuration values and secrets can be loaded from external sources
(Kubernetes Secrets, Vault... See [vals] for details)
- current namespace isn't exposed by default
- there's often more than one way to do it!
(this particular section could be improved by using Bento `${...}`)
[vals]: https://github.com/helmfile/vals
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
## 🏗️ Let's build something!
- Write a helmfile (or two) to set up today's entire stack on a brand new cluster!
- Suggestion:
- one helmfile for singleton, cluster components
(All our operators: Prometheus, Grafana, KEDA, CNPG, RabbitMQ Operator)
- one helmfile for the application stack
(Bento, PostgreSQL cluster, RabbitMQ)
.debug[[k8s/helmfile.md](https://github.com/jpetazzo/container.training/tree/main/slides/k8s/helmfile.md)]
---
class: title, self-paced
Thank you!
.debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/main/slides/shared/thankyou.md)]
---
class: title, in-person
That's all, folks!
Questions?

.debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/main/slides/shared/thankyou.md)]
---
name: contact
## Contact information
.column-half[
Instructor:
📛 Jérôme Petazzoni
📩 jerome.petazzoni@gmail.com
🔗 https://linkedin.com/in/jpetazzo
🦣 https://hachyderm.io/@jpetazzo
I can teach custom courses:
- Docker, Kubernetes, MLOps
- from intro level to "black belt"
- on site or remotely
Reach out if you're interested!
]
.column-half[
Assistant:
📛 AJ Bowen
📩 aj@soulshake.net
🔗 https://linkedin.com/in/ajbowen
📃 https://github.com/soulshake
I can consult on the following topics:
- Kubernetes
- CI/CD
- Terraform & Infra-as-code
- Docker
- AWS
]
.debug[[shared/contact.md](https://github.com/jpetazzo/container.training/tree/main/slides/shared/contact.md)]