Context and motivation

Kubernetes has built-in features & mecanisms to keep healthy kubernetes nodes and workoads:

  • kube-scheduler decides on which nodes to place pods in function of pod requested resources and node unreserved resources.

  • kubelet Out-Of-Memory kills pods that consumes more resources than limited values defined in the spec (OOM killed).

  • For any reason, if the node is run out of resources, kubelet evicts pods to relieve the pressure on the nodes (pod eviction). Pod eviction decision is based on QoS of pods.

Keep in mind that cegedim.cloud provides standard Kubernetes clusters with these features and qualified the official Kubernetes documentations below:

The problem is in real life application:

  • not all technologies are natively container friendly

  • resource usage metrics collected by kubelet (or node exporter, etc.) is not real time

  • resource usage metrics are not taken into account by kube scheduler

  • kubelet as a Linux process is not always the most prioritized process, especially when nodes run out of CPU.

Failing to handle resources stresses on nodes by kubelet leads to node failures and the redeployment of all workloads related. In worse case, a domino effect on node faillure can happen.

cegedim.cloud's solution

cegedim.cloud provides a hardening solution called cgdm-hardening:

  • One pod hardening-slave per worker nodes: writes CPU & RAM consumption to centralized database

  • One pod hardening-master deployed on master nodes: reads metrics from database and takes action in case of crisis

  • Hardening stack has a very low resource footprint

Hardening-master pod can take action in two modes:

  • Preventive mode (as a kube scheduler assistant, default mode): puts the taint cegedim.io/overload=true:NoSchedule to avoid placing more pods on under-pressure nodes (85% RAM or 90% CPU). When CPU is below 85% and RAM is below 80% taint will be removed.

  • Protective mode (as a kube controler assistant): when RAM consumption reach 95%, kills newest pods, ones after anothers, to relieve the pressure. It is not activated by default

You should never use wildcard toleration on applications, otherwise preventive effect of this solution will be invalid.

Toleration to avoid in apps
  - effect: NoSchedule
    operator: Exists

Limitation: Node faillure due to extremly high peak of CPU during very short period of time can not be mitigated with this solution.

How to disable / enable the hardening

New Kubernetes clusters wil be provisioned with the preventive hardening activated.

If workloads deployed by customers create a lot of node failure (TLS_K8_NODES), the protective mode will be activated.

Customer can disable this hardening by creating an ITCare request ticket. This means customer will have to reboot the nodes themself in case of crisis.

Customer can re-enable this hardening by creating an ITCare request ticket any time.

Last updated