Kubernetes probes and self healing¶
If the container has not probes defined, they will be considered as success
startupProbe¶
This was included after readinessProbe and livenessProbe. It is commonly used to setup a delay in pods with slow startup process. The readinessProbe and livenessProbe are not checked until the startupProbe is considered "Success".
If your container usually starts in more than initialDelaySeconds + failureThreshold × periodSeconds, you should specify a startup probe that checks the same endpoint as the liveness probe. The default for periodSeconds is 10s. You should then set its failureThreshold high enough to allow the container to start, without changing the default values of the liveness probe. This helps to protect against deadlocks.
If the probe is not ok, kubelet will kill the container and it will apply the pod restart policy:
- Always (default)
The container is restarted
- OnFailure
The container is restart if the container had an exit status different than 0
- Never
Never restart the container
readinessProbe¶
This probe is related with considering the container is ready to accept petitions. It is useful when what to define when to start sending traffic to the container.
- If the probe fails, the container is pulled of from the services to stop receiving traffic
- The default result is "Failure". It must be accomplished to be considered as "Success"
livenessProbe¶
This probe is related with considering the container is alive and running.
It is useful like a way to tell kubelet the pod crashed, it encounters an issue or becomes unhealthy. The kubelet will automatically perform the correct action in accordance with the Pod's restartPolicy.
Recommendations¶
- Use startupProbe for slow starting apps
- The probes must be simple and lightweight
- Ensure the probe target is independent of the main application
- They can fail in heavy loaded environments
- In general, it is a best practice to define a livenessProbe and a readinessProbe. And they must be different.
- If using the same endpoint, set a higher failureThreshold value for the livenessProbe, that is, disconnect traffic and customers earlier, and if things are really bad, then restart.