This document describes some common issues and solutions when you use a TiDB cluster in Kubernetes.
When a Pod is in the
CrashLoopBackoff state, the containers in the Pod quit continually. As a result, you cannot use
kubectl exec or
tkctl debug normally, making it inconvenient to diagnose issues.
To solve this problem, TiDB in Kubernetes provides the Pod diagnostic mode for PD, TiKV, and TiDB components. In this mode, the containers in the Pod hang directly after starting, and will not get into a state of repeated crash. Then you can use
kubectl exec or
tkctl debug to connect to the Pod containers for diagnosis.
To use the diagnostic mode for troubleshooting:
Add an annotation to the Pod to be diagnosed:
kubectl annotate pod <pod-name> -n <namespace> runmode=debug
The next time the container in the Pod is restarted, it detects this annotation and enters the diagnostic mode.
Wait for the Pod to enter the Running state.
watch kubectl get pod <pod-name> -n <namespace>
Start the diagnosis.
Here’s an example of using
kubectl exec to get into the container for diagnosis:
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
After finishing the diagnosis and resolving the problem, delete the Pod.
kubectl delete pod <pod-name> -n <namespace>
After the Pod is rebuilt, it automatically returns to the normal mode.
TiDB Operator uses PV (Persistent Volume) and PVC (Persistent Volume Claim) to store persistent data. If you accidentally delete a cluster using
helm delete, the PV/PVC objects and data are still retained to ensure data security.
To restore the cluster at this time, use the
helm install command to create a cluster with the same name. The retained PV/PVC and data are reused.
helm install pingcap/tidb-cluster -n <release-name> --namespace=<namespace> --version=<chart_version> -f values.yaml
After creating a cluster using
helm install, if the Pod is not created, you can diagnose it using the following commands:
kubectl get tidbclusters -n <namespace> kubectl get statefulsets -n <namespace> kubectl describe statefulsets -n <namespace> <release-name>-pd
In a TiDB cluster, you can access most Pods by using the Pod’s domain name (allocated by the Headless Service). The exception is when TiDB Operator collects the cluster information or issues control commands, it accesses the PD (Placement Driver) cluster using the
service-name of the PD service.
When you find some network connection issues between Pods from the log or monitoring metrics, or you find the network connection between Pods might be abnormal according to the problematic condition, you can follow the following process to diagnose and narrow down the problem:
Confirm that the endpoints of the Service and Headless Service are normal:
kubectl -n <namespace> get endpoints <release-name>-pd kubectl -n <namespace> get endpoints <release-name>-tidb kubectl -n <namespace> get endpoints <release-name>-pd-peer kubectl -n <namespace> get endpoints <release-name>-tikv-peer kubectl -n <namespace> get endpoints <release-name>-tidb-peer
ENDPOINTS field shown in the above command should be a comma-separated list of
cluster_ip:port. If the field is empty or incorrect, check the health of the Pod and whether
kube-controller-manager is working properly.
Enter the Pod’s Network Namespace to diagnose network problems:
tkctl debug -n <namespace> <pod-name>
After the remote shell is started, use the
dig command to diagnose the DNS resolution. If the DNS resolution is abnormal, refer to Debugging DNS Resolution for troubleshooting.
ping command to diagnose the connection with the destination IP (the ClusterIP resolved using
ping check fails, refer to Debugging Kubernetes Networking for troubleshooting.
ping check succeeds, continue to check whether the target port is open by using
telnet <target_ip> <target_port>
telnet check fails, check whether the port corresponding to the Pod is correctly exposed and whether the applied port is correctly configured:
# Checks whether the ports are consistent. kubectl -n <namespace> get po <pod-name> -ojson | jq '.spec.containers.ports.containerPort' # Checks whether the application is correctly configured to serve the specified port. # The default port of PD is 2379 when not configured. kubectl -n <namespace> -it exec <pod-name> -- cat /etc/pd/pd.toml | grep client-urls # The default port of PD is 20160 when not configured. kubectl -n <namespace> -it exec <pod-name> -- cat /etc/tikv/tikv.toml | grep addr # The default port of TiDB is 4000 when not configured. kubectl -n <namespace> -it exec <pod-name> -- cat /etc/tidb/tidb.toml | grep port
The Pending state of a Pod is usually caused by conditions of insufficient resources, such as:
StorageClassof the PVC used by PD, TiKV, Monitor Pod does not exist or the PV is insufficient.
You can check the specific reason for Pending by using the
kubectl describe pod command:
kubectl describe po -n <namespace> <pod-name>
If the CPU or memory resources are insufficient, you can lower the CPU or memory resources requested by the corresponding component for scheduling, or add a new Kubernetes node.
StorageClass of the PVC cannot be found, delete the TiDB Pod and the corresponding PVC. Then, in the
values.yaml file, change
storageClassName to the name of the
StorageClass available in the cluster. Run the following command to get the
StorageClass available in the cluster:
kubectl get storageclass
StorageClass exists in the cluster but the available PV is insufficient, you need to add PV resources correspondingly. For Local PV, you can expand it by referring to Local PV Configuration.
A Pod in the
CrashLoopBackOff state means that the container in the Pod repeatedly aborts, in the loop of abort - restart by
kubelet - abort. There are many potential causes of
CrashLoopBackOff. In this case, the most effective way to locate it is to view the log of the Pod container:
kubectl -n <namespace> logs -f <pod-name>
If the log fails to help diagnose the problem, you can add the
-p parameter to output the log information when the container was last started:
kubectl -n <namespace> logs -p <pod-name>
In addition, TiKV might also fail to start when
ulimit is insufficient. In this case, you can modify the
/etc/security/limits.conf file of the Kubernetes node to increase the
root soft nofile 1000000 root hard nofile 1000000 root soft core unlimited root soft stack 10240
If you cannot confirm the cause from the log and
ulimit is also a normal value, troubleshoot it further by using the diagnostic mode.
If you cannot access the TiDB service, first check whether the TiDB service is deployed successfully using the following method:
Check whether all components of the cluster are up and the status of each component is
kubectl get po -n <namespace>
Check the log of TiDB components to see whether errors are reported.
kubectl logs -f <tidb-pod-name> -n <namespace> -c tidb
If the cluster is successfully deployed, check the network using the following steps:
If you cannot access the TiDB service using
NodePort, try to access the TiDB service using the service domain or
clusterIP on the node. If the
clusterIP works, the network within the Kubernetes cluster is normal. Then the possible issues are as follows:
externalTrafficPolicyattribute of the TiDB service is
Local. If it is
Local, you must access the client using the IP of the node where the TiDB Pod is located.
If you still cannot access the TiDB service using the service domain or
clusterIP, connect using
<PodIP>:4000 on the TiDB service backend. If the
PodIP works, you can confirm that the problem is in the connection between the service domain and
PodIP or between
PodIP. Check the following items:
Check whether the DNS service works well.
kubectl get po -n kube-system -l k8s-app=kube-dns dig <tidb-service-domain>
kube-proxy on each node is working.
kubectl get po -n kube-system -l k8s-app=kube-proxy
Check whether the TiDB service rule is correct in the
iptables-save -t nat |grep <clusterIP>
Check whether the corresponding endpoint is correct.
If you cannot access the TiDB service even using
PodIP, the problem is on the Pod level network. Check the following items: