K8sGPT : Un apport de l'IA pour les ingénieurs SREs pour le troubleshooting kubernetes

Publié le 02/02/2024

⭠ Retour à la liste des articles

Introduction

Simplification des diagnostics avec K8sGPT

L'un des obstacles les plus courants auxquels sont confrontés les SREs est l'identification et le diagnostic des problèmes au sein des clusters Kubernetes. K8sGPT aborde cette question avec une approche novatrice qui combine l'intelligence artificielle avec le traitement du langage naturel. Cette combinaison permet aux utilisateurs de poser des questions en langage simple et de recevoir des explications et des solutions compréhensibles, éliminant ainsi la nécessité d'une expertise approfondie des commandes Kubernetes et de leurs multiples paramètres.

Optimisation de l'efficacité opérationnelle

Le temps est un facteur critique dans la gestion des systèmes. K8sGPT permet de réduire significativement le temps nécessaire pour comprendre et résoudre les problèmes, ce qui se traduit par une augmentation de l'efficacité opérationnelle. Les SREs peuvent ainsi se concentrer sur des tâches plus stratégiques, comme l'amélioration continue de l'infrastructure et l'innovation.

K8sGPT un assitant efficace

K8sGPT se pose comme le partenaire idéal pour les SREs, en leur offrant la capacité de scanner, diagnostiquer et de trier les problèmes au sein des clusters Kubernetes avec des explications simples dans la langue que vous souhaitez. Il représente l'alliance parfaite entre l'expertise humaine des SREs et la puissance analytique de l'IA.

K8sGPT incarne la tendance vers l'assistance par l'IA dans le cloud computing et l'orchestration des conteneurs, un mouvement qui s'aligne sur l'intégration de l'IA et de l'apprentissage automatique pour simplifier les opérations informatiques, un domaine que l'on désigne souvent par AIOps.

Le repo github : https://github.com/k8sgpt-ai
La documentation officielle : https://docs.k8sgpt.ai

Cet outil peut être utilisé en mode CLI ou en mode "in-Cluster Operator"

Avantages de K8sGPT

Simplification des diagnostics: K8sGPT facilite l'identification des problèmes dans les clusters Kubernetes, permettant aux utilisateurs de poser des questions en langage naturel et de recevoir des réponses compréhensibles.
Amélioration de l'efficacité: en réduisant le temps nécessaire pour comprendre et résoudre les problèmes, K8sGPT augmente l'efficacité opérationnelle.
Accès ouvert et Développement Collaboratif: Étant un outil open-source, K8sGPT bénéficie de l'expertise collective de la communauté des développeurs.
L'utilisation de différents backend IA (openai, localai, azureopenai, cohere, amazonbedrock, amazonsagemaker, google, huggingface, noopai)
Intégration Innovante: K8sGPT est conçu pour s'intégrer facilement dans les processus existants de gestion de Kubernetes et peut être couplé avec d'autres services (trivy, prometheus pour l'instant).
Potentiel d'Automatisation: Les fonctionnalités futures de K8sGPT pourraient inclure l'auto-remédiation des problèmes Kubernetes, réduisant davantage la charge de travail des équipes opérationnelles.

Scope d'analyse

k8sgptà les analyseurs suivants :

podAnalyzer
pvcAnalyzer
rsAnalyzer
serviceAnalyzer
eventAnalyzer
ingressAnalyzer
statefulSetAnalyzer
deploymentAnalyzer
cronJobAnalyzer
nodeAnalyzer
mutatingWebhookAnalyzer
validatingWebhookAnalyzer

Optionnement

hpaAnalyzer
pdbAnalyzer
networkPolicyAnalyzer
gatewayClass
gateway
httproute

Utilisation du CLI K8sGPT

Après avoir installé (https://docs.k8sgpt.ai/getting-started/installation/)

Il faudra définir les backend que vous souhaitez utiliser.

A. Ajouter un backend avec la commande auth, quelques exemples

openai : simplement executer la commande k8sgpt vous devrez saisir votre clé d'api

azureopenai :

k8sgpt auth add  --backend azureopenai --baseurl https://zzzzzz.openai.azure.com/  --engine gpt-4 --model gpt-4

vous devrez saisir votre clé Azure OpenAI

Les commandes disponibles pour k8sgpt auth sont : add, default, list, remove, update

Pour utiliser k8sgpt avec un backend donné, il suffit de mettre le flag --backend

Si vous positionnez k8sgpt auth default [provider] celui ci sera pris par défaut si vous ne spécifiez pas de --backend

B. Analyse des problèmes et enrichissement avec l'IA

Déployons le manifest ci-dessous ayant plein d'erreurs, il nous sert a faire dans les tests que nous faisons passer à nos futurs devops/SRE

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pi-web
  labels:
    k8s.alterwaylabs.fr: troubleshooting
spec:
  replicas: 0
  selector:
    matchLabels:
      app: pi-web
  template:
    metadata:
      labels:
        app: pi-web
    spec:
      containers:
        - image: kiamol/ch05-pi-app
          command: ["donet", "Pi.Web.dll", "-m", "web"]
          name: web
          ports:
            - containerPort: 80
              name: http
          resources:
            limits:
              cpu: "32"
              memory: "128Gi"
          readinessProbe:
            tcpSocket:
              port: 8020
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /healthy
              port: 80
            periodSeconds: 30
            failureThreshold: 1
---

apiVersion: v1
kind: Service
metadata:
  name: pi-np
  labels:
    k8s.alterwaylabs.fr: troubleshooting
spec:
  selector:
    app: pi-web-pod
  ports:
    - name: http
      port: 8020
      targetPort: app
      nodePort: 30020
  type: NodePort

---
apiVersion: v1
kind: Service
metadata:
  name: pi-lb
  labels:
    k8s.alterwaylabs.fr: troubleshooting
spec:
  selector:
    app: pi-web-pod
  ports:
    - name: http
      port: 8020
      targetPort: app
  type: ClusterIP

Quels sont les problèmes et voyons comment k8sgpt se comporte.

kubectl get svc,deploy,po

❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         78m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   77m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/0     0            0           78m

Passons la commande suivante :

k8sgpt analyze -b azureopenai --explain --with-doc -n default

En réponse, nous avons :

 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 718 it/s)
AI Provider: azureopenai

0 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web-pod
Error: The Kubernetes service is looking for pods with the label "app=pi-web-pod" to route traffic to, but it can't find any.

Solution:
1. Check if pods are running: `kubectl get pods -l app=pi-web-pod`
2. If no pods, deploy one with the label: `kubectl run pi-web-pod --image=<image> --labels="app=pi-web-pod"`
3. If pods exist but have different labels, update them with the correct label: `kubectl label pods <pod-name> app=pi-web-pod --overwrite`
1 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web-pod
Error: The Kubernetes service is looking for pods with the label "app=pi-web-pod" to route traffic to, but it can't find any.

Solution:
1. Check if pods are running: `kubectl get pods -l app=pi-web-pod`
2. If no pods, deploy one with the label: `kubectl run pi-web-pod --image=<image> --labels="app=pi-web-pod"`
3. If pods exist but have different labels, update them with the correct label: `kubectl label pods <pod-name> app=pi-web-pod --overwrite`

Clairement les selecteurs du service ne sont pas bons. il faut que les selecteurs du services correspondent avec les labels posés sur le pod.

Pod: - app: pi-web

Service - selector : app: pi-web-pod

Modifions les selecteur de services en pi-web, et appliquons les modifications.

kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         87m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   85m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/0     0            0           86m

Si on repasse la commande k8sgpt précédente on a autre chose :

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 1454 it/s)
AI Provider: azureopenai

0 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
1 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.

Le problème détecté est qu'il n'y a personne (pod) en face du service. le service ne rattache aucun pod. En regardant, on s'aperçoit que le nombre de replica est a 0. Effectivement dans le deployment l'attribut replicas est à 0. Modifions le à 1 et appliquons la modification.

spec:
  replicas: 1

Regardons notre déployment, visiblement il y a encore des problèmes

❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         92m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   91m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/1     1            0           92m

NAME                          READY   STATUS    RESTARTS   AGE
pod/pi-web-58595777f5-5v476   0/1     Pending   0          9s

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (3/3, 38 it/min)
AI Provider: azureopenai

0 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
1 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
2 default/pi-web-58595777f5-5v476(Deployment/pi-web)
- Error: 0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
Error: All nodes lack enough CPU and memory to schedule your pod. Additionally, no pods can be preempted to free up resources.

Solution:
1. Check pod resource requests.
2. Reduce pod requests if excessive.
3. Scale up the cluster with more or larger nodes.
4. Adjust pod priority for preemption.

Clairement, on a un problème au niveau des requests ou limits à la fois sur le nombre de CPU et de la taille Mémoire ... En regardant le manifest on voit que :

          resources:
            limits:
              cpu: "32"
              memory: "128Gi"

Bien que nous ayons beaucoup de ressources chez Alter Way ;) Les VM déployées n'ont pas ca.

On aurait aimé que k8sgpt puisse nous donner les informations des noeuds.

Peut-être faudra t'il implémenter des fonctions spécifiques pour donner des informations complémentaires.

Modifions le manifest avec les valeurs suivantes et appliquons.

          resources:
            limits:
              cpu: "1"
              memory: "1Gi"

Toujours des problèmes mais d'autres natures.

❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         103m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   102m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/1     1            0           103m

NAME                          READY   STATUS             RESTARTS   AGE
pod/pi-web-58595777f5-5v476   0/1     Pending            0          11m
pod/pi-web-65f5874c6c-7clq8   0/1     ImagePullBackOff   0          2m39s

Le fameux ImagePullBackOff...

Que nous donne k8sgpt ?

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (5/5, 19 it/min)
AI Provider: azureopenai

0 default/pi-web()
- Error: Deployment default/pi-web has 1 replicas but 2 are available
  Kubernetes Doc: Number of desired pods. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.
Error: The deployment "pi-web" in the "default" namespace is configured to have 1 replica, but there are 2 replicas running, which is inconsistent with the desired state.

Solution:
1. Verify deployment spec: `kubectl get deployment pi-web -o yaml`.
2. Check replicas: `kubectl get rs`.
3. If extra replica set, scale down: `kubectl scale deployment/pi-web --replicas=1`.
4. Confirm: `kubectl get deployment pi-web`.
1 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
2 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
3 default/pi-web-65f5874c6c-7clq8(Deployment/pi-web)
- Error: Back-off pulling image "kiamol/ch05-pi-app"
Error: Kubernetes has tried and failed several times to pull the container image "kiamol/ch05-pi-app" from the registry.

Solution:
1. Check image name/spelling.
2. Ensure Docker is logged in to registry.
3. Confirm image exists & is accessible.
4. Check network connectivity.
5. Inspect image pull policy & credentials.
4 default/pi-web-58595777f5-5v476(Deployment/pi-web)
- Error: 0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
Error: All nodes lack enough CPU and memory to schedule your pod. Additionally, no pods can be preempted to free up resources.

Solution:
1. Check pod resource requests.
2. Reduce pod requests if excessive.
3. Scale up the cluster with more or larger nodes.
4. Adjust pod priority for preemption.

Bon c'est intéressant la première erreur n'est pas vraiment une erreur mais vient du fait que applique au fur et a mesure le même manifest et que comme les pods ne démarrent pas ils s'entassent et le nettoyage n'est pas fait.

Nous allons supprimer le déploiement et le recréer pour avoir des messages plus clairs.

❯ k delete deploy pi-web && k apply -f moche.yaml

❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         110m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   109m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/1     1            0           22s

NAME                          READY   STATUS             RESTARTS   AGE
pod/pi-web-65f5874c6c-9c8np   0/1     ImagePullBackOff   0          22s

OK, que donne k8sgpt

0 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
1 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
2 default/pi-web-65f5874c6c-9c8np(Deployment/pi-web)
- Error: Back-off pulling image "kiamol/ch05-pi-app"
Error: Kubernetes has tried and failed several times to pull the container image "kiamol/ch05-pi-app" from the registry.

Solution:
1. Check image name/spelling.
2. Ensure Docker is logged in to registry.
3. Confirm image exists & is accessible.
4. Check network connectivity.
5. Inspect image pull policy & credentials.

Toujours le problème de endpoint, c'est "normal" le pod ne tourne pas. et pour l'image il faut vérifier si l'image est correcte, si la registry n'est pas authentifiée.

En regardant le hub docker on voit que l'image est kiamol/ch05-pi et pas kiamol/ch05-pi-app

Modifions et appliquons (suppression et application).

k delete deploy pi-web && k apply -f moche.yaml
❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         114m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   113m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/1     1            0           50s

NAME                         READY   STATUS             RESTARTS      AGE
pod/pi-web-ddff7649c-dvxgq   0/1     CrashLoopBackOff   2 (24s ago)   50s

CrashLoopBackOff, Que nous donne k8sgpt comme explication

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (3/3, 25 it/min)
AI Provider: azureopenai

0 default/pi-web-ddff7649c-dvxgq(Deployment/pi-web)
- Error: back-off 40s restarting failed container=web pod=pi-web-ddff7649c-dvxgq_default(422abf26-61df-4b49-bfab-9682f3233e95)
- Error: the last termination reason is StartError container=web pod=pi-web-ddff7649c-dvxgq
Error: The 'web' container in the 'pi-web-ddff7649c-dvxgq' pod is failing to start, causing Kubernetes to retry with back-off delays.

Solution:
1. Use `kubectl describe pod pi-web-ddff7649c-dvxgq` to see events and errors.
2. Check logs with `kubectl logs pi-web-ddff7649c-dvxgq -c web`.
3. Resolve any config or code issues.
4. Restart the pod if necessary.
1 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
2 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.

OK, suivons pas à pas les recommandations

kubectl describe pod pi-web-ddff7649c-dvxgq

 Warning  Failed     112s (x4 over 2m44s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "donet": executable file not found in $PATH: unknown

Clairement, il y a un problème sur la commande de démarrage du container

command: ["donet", "Pi.Web.dll", "-m", "web"]

Hé oui, ce n'est donet mais dotnet ;)

command: ["dotnet", "Pi.Web.dll", "-m", "web"]

corrigeons...

❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         119m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   118m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   0/1     1            0           15s

NAME                          READY   STATUS    RESTARTS   AGE
pod/pi-web-5b6fb6b4f5-pj7f9   0/1     Running   0          15s

Est-ce que k8sgpt va nous trouver pourquoi le pod qui est déployé, running mais pas prêt ?

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (3/3, 28 it/min)
AI Provider: azureopenai

0 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
1 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
2 default/pi-web-5b6fb6b4f5-pj7f9(Deployment/pi-web)
- Error: Readiness probe failed: dial tcp 10.224.1.141:8020: connect: connection refused
Error: The service at IP 10.224.1.141 on port 8020 isn't accepting connections, causing the readiness probe to fail.

Solution:
1. Check the pod's logs for errors.
2. Verify the service is running on port 8020.
3. Ensure network policies allow traffic on port 8020.
4. Confirm the readiness probe's configuration is correct.
5. Restart the pod if necessary.

Bravo ! Les détails sont très explicites.

Nous allons corriger

          readinessProbe:
            tcpSocket:
              port: 80

          readinessProbe:
            tcpSocket:
              port: 80

❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         124m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   122m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   1/1     1            1           58s

NAME                          READY   STATUS    RESTARTS      AGE
pod/pi-web-587d5878b5-25b98   1/1     Running   1 (27s ago)   58s

On voit qu'il y a des restart sur le pod, quand on passe la commande k8sgpt, on voit qu'il y a un problème.

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (3/3, 19 it/min)
AI Provider: azureopenai

0 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
1 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
2 default/pi-web-587d5878b5-25b98(Deployment/pi-web)
- Error: back-off 40s restarting failed container=web pod=pi-web-587d5878b5-25b98_default(8d5fd70a-fd18-44b3-9cc4-257968ad8883)
- Error: the last termination reason is Completed container=web pod=pi-web-587d5878b5-25b98
Error: The 'web' container within the 'pi-web-587d5878b5-25b98' pod is failing to start after attempting to restart, possibly because the last run completed successfully but was not supposed to terminate.

Solution:
1. Check the logs: `kubectl logs pi-web-587d5878b5-25b98 -c web --previous`
2. Verify pod's restart policy.
3. Review container's exit condition.
4. Fix any application or configuration issues.
5. Apply updates with `kubectl apply` if needed.
6. Monitor pod status with `kubectl get pods`.

Suivons les recommandations.

kubectl logs pi-web-587d5878b5-25b98 -c web --previous

Tout est OK, les point 2 a 6 ne donne rien non plus

ici, k8sgpt est pris en défaut car il ne voit le problème au niveau de la liveness :( Un simple kublectl describe po donne le problème

Essayons ca, passons en mode interactif :

k8sgpt analyze -b azureopenai   --explain --with-doc -n default -i

Je colle la sortie du kubectl describe du pod avec l'erreur de liveness et je demande quel est le problème

For issue 3, the liveness probe is failing with a 404 HTTP status code, suggesting that the application isn't responding correctly at the expected endpoint. You need to check the liveness probe configuration and ensure that the application is configured to respond to that probe at the right path and port. If necessary, update the probe configuration in the deployment to match the application's behavior.

To check the current configuration of the liveness probe, you can describe the pod:

kubectl describe pod pi-web-587d5878b5-25b98

Look for the `Liveness` section to see the current configuration and adjust it accordingly in your deployment YAML file.

Finally, apply any changes to your deployment:

OK, là il faut avoir un peu d'expérience pour fixer le problème. Remplaçons juste path: /healthy par path: /

❯ k delete deploy pi-web && k apply -f moche.yaml
deployment.apps "pi-web" deleted
deployment.apps/pi-web created
service/pi-np unchanged
service/pi-lb unchanged
❯ kubectl get svc,deploy,po
NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.0.0.1      <none>        443/TCP          84d
service/pi-lb        ClusterIP   10.0.68.192   <none>        8020/TCP         136m
service/pi-np        NodePort    10.0.152.46   <none>        8020:30020/TCP   135m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pi-web   1/1     1            1           14s

NAME                          READY   STATUS    RESTARTS   AGE
pod/pi-web-86dcdc9876-46mps   1/1     Running   0          14s

A priori tout est bon...mais...

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default

❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
 100% |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (2/2, 603 it/s)
AI Provider: azureopenai

0 default/pi-lb(pi-lb)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.
1 default/pi-np(pi-np)
- Error: Service has no endpoints, expected label app=pi-web
Error: The service in Kubernetes is not linked to any pods because none match the selector 'app=pi-web'.

Solution:
1. Check pods have correct labels: `kubectl get pods --show-labels`.
2. If incorrect/missing, label pods: `kubectl label pods <pod-name> app=pi-web`.
3. Verify service endpoints: `kubectl get endpoints`.

Bon la aussi, il faut avoir un peu d'experience et pour le coup regarder les services.

Les ports font référence à un port sur le pod qui s'appelle app (targetPort), hors le port est nommé http dans le déploiement.

Corrigeons...

  ports:
    - name: http
      port: 8020
      targetPort: app

  ports:
    - name: http
      port: 8020
      targetPort: http
      nodePort: 30020

Allez, on y croit !!!

❯ k delete deploy pi-web && k apply -f moche.yaml
deployment.apps "pi-web" deleted
deployment.apps/pi-web created
service/pi-np configured
service/pi-lb configured
❯ k8sgpt analyze -b azureopenai   --explain --with-doc -n default
AI Provider: azureopenai

No problems detected

Hourra !

Conclusion

Avec k8sgpt j'ai pu corrigé toutes les erreurs de mes déploiement et services avec des instructions détaillées sur les causes et comment y remédier.

Pour un SRE junior c'est un gain de temps précieux.

K8sGPT représente une avancée significative dans la gestion des clusters Kubernetes. En tirant parti de l'IA pour la compréhension et l'analyse des problèmes, cet outil offre aux SREs une puissance et une simplicité accrues pour maintenir la santé et la performance des systèmes Kubernetes. L'aspect open-source et les potentialités d'intégration et d'automatisation font de K8sGPT une solution prometteuse pour l'avenir de la gestion des infrastructures cloud.