Troubleshooting

This guide helps you diagnose and resolve common issues encountered when working with cegedim.cloud Kubernetes clusters.

Rancher UI Issues

Rancher UI Stuck on "Loading"

Symptoms: After logging in, the Rancher UI displays "Loading" indefinitely.

Solution:

Try accessing the direct dashboard URL:
- For ET region: https://rancher-et.cegedim.cloud/dashboard/home
- For EB production: https://rancher-eb.cegedim.cloud/dashboard/home
- For EB non-production: https://rancher-eb-qa.cegedim.cloud/dashboard/home
If the issue persists, log out and log back in
Try using a different browser or incognito/private mode
Clear your browser cache and cookies for the Rancher domain

Symptoms: After your first login to Rancher, your cluster doesn't appear in the cluster list.

Solution:

Log out of Rancher completely
Log back in
Your cluster should now appear in the list
If the cluster still doesn't appear, verify your access rights through ITCare or contact your administrator

Cannot Access Rancher (Connection Refused or Timeout)

Symptoms: Unable to reach rancher-et.cegedim.cloud or rancher-eb.cegedim.cloud.

Solution:

Check network access: Some Rancher instances are only accessible from the server network
- rancher-et.cegedim.cloud - Requires server network access (connect through bastion)
- rancher-eb.cegedim.cloud (production) - Requires server network access (connect through bastion)
- rancher-eb-qa.cegedim.cloud (non-production) - Accessible from standard network
Verify Rancher status: Check if a Rancher upgrade is in progress (typically 15-30 minutes)

kubectl Access Issues

kubectl Commands Fail with "Connection Refused"

Symptoms: kubectl commands return connection errors or timeouts.

Possible Causes and Solutions:

1. Rancher Proxy Issue

If your kubeconfig uses Rancher URL, Rancher might be down or upgrading
Wait for Rancher to become available again
Consider using direct cluster access if available

2. Invalid or Expired Credentials

Download a fresh kubeconfig from Rancher UI
Verify your token hasn't expired (check token lifecycle in Rancher)

3. Network Connectivity

Test connectivity: curl -v https://<rancher-url>
Verify you're on the correct network (bastion for ET/EB production)
Check firewall rules and proxy settings

kubectl Context Not Switching

Symptoms: kubectl commands affect the wrong cluster.

Solution:

# List all available contexts
kubectl config get-contexts

# Switch to the correct context
kubectl config use-context <context-name>

# Verify current context
kubectl config current-context

Cluster Access and Authentication

"Forbidden" Errors When Running kubectl Commands

Symptoms: Commands return "Error from server (Forbidden): is forbidden".

Solution:

Verify your access rights in Rancher: Check the "Manage Rights" page for your Project/Cluster permissions
Use SelfSubjectAccessReview: Run the following command to check your permissions for specific resources:

kubectl create -f - -o yaml << EOF
apiVersion: authorization.k8s.io/v1
kind: SelfSubjectAccessReview
spec:
  resourceAttributes:
    group: ""
    resource: "*"
    verb: "*"
EOF

Check Project/Namespace permissions: Ensure you have the correct role in the Project
Verify AD group membership: Confirm you're in the correct G_K8_* groups
Check token scope: Ensure you're using a cluster-scoped token for kubectl operations

Cannot Create Resources in Namespace

Symptoms: Permission denied when creating pods, deployments, etc.

Solution:

Verify the namespace belongs to a Project you have access to
If the namespace was created via kubectl (not Rancher UI), it may be in the "Default" project with restricted access
Contact your Project admin to move the namespace to the correct Project or grant permissions

Workload Issues

Pods Stuck in "Pending" State

Symptoms: Pods remain in "Pending" status and don't start.

Diagnosis:

# Check pod details
kubectl describe pod <pod-name> -n <namespace>

# Look for events at the bottom of the output

Common Causes and Solutions:

1. Insufficient Resources

Message: "Insufficient cpu" or "Insufficient memory"
Solution: Request more nodes through ITCare or reduce resource requests

2. Persistent Volume Issues

Message: "persistentvolumeclaim not found" or "no persistent volumes available"
Solution: Verify PVC exists and storage class is correct

3. Node Selector/Affinity Mismatch

Message: "No nodes are available that match all of the following predicates"
Solution: Review nodeSelector and affinity rules

4. Image Pull Errors

Message: "Failed to pull image" or "ImagePullBackOff"
Solution: See "Image Pull Issues" section below

Image Pull Issues (ImagePullBackOff)

Symptoms: Pods fail with "ImagePullBackOff" or "ErrImagePull" status.

Diagnosis:

kubectl describe pod <pod-name> -n <namespace>
# Look for "Failed to pull image" messages

Common Causes and Solutions:

1. Private Registry Authentication

Create or verify image pull secret exists
Ensure secret is referenced in pod spec or service account

2. Image Name Typo

Verify image name and tag are correct
Check registry URL is properly formatted

3. Network Connectivity to Registry

Verify cluster can reach external registry
Check if network policies block registry access
Request network opening through ITCare if needed

Ingress Not Routing Traffic

Symptoms: Cannot access application through ingress URL.

Diagnosis:

# Check ingress configuration
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>

# Verify service and endpoints
kubectl get svc -n <namespace>
kubectl get endpoints <service-name> -n <namespace>

Common Causes and Solutions:

1. Incorrect Ingress Class

For Nginx (default): No class annotation needed or use kubernetes.io/ingress.class: "nginx"
For Nginx external: Use kubernetes.io/ingress.class: "nginx-ext"
For Traefik: Use appropriate Traefik ingress class
For Istio: Use Istio Gateway configuration

2. Service Not Found or Misconfigured

Verify service name and port match ingress backend
Check that service has endpoints (pods selected)

3. Certificate Issues

Default: *.yourclustername.ccs.cegedim.cloud certificate is pre-configured
Custom domains: Request certificate configuration through ITCare

Persistent Storage Issues

PVC Stuck in "Pending" State

Symptoms: PersistentVolumeClaim remains "Pending" and pods cannot start.

Diagnosis:

kubectl describe pvc <pvc-name> -n <namespace>
# Look for error messages in Events

Common Causes and Solutions:

1. Storage Class Not Found

Verify storage class name in PVC
List available storage classes: kubectl get storageclass
Use Ceph-based storage classes provided by cegedim.cloud

2. Storage Quota Exceeded

Check if storage quota is available
Request additional storage through ITCare

3. Ceph CSI Not Available

Verify Ceph CSI is enabled for your cluster
Contact support if Ceph CSI is not provisioned

Network Policy Issues

Pods Cannot Communicate Between Namespaces

Symptoms: Pods in different namespaces cannot reach each other.

Understanding Rancher Project Network Isolation:

Pods in namespaces within the same Rancher Project can communicate by default
Pods in namespaces in different Rancher Projects cannot communicate unless explicitly allowed

Solution:

Option 1: Move namespaces to the same Rancher Project (if appropriate)
Option 2: Create a NetworkPolicy to explicitly allow cross-project communication
Note: Pods in the "System" project can communicate with all other projects

Pods Cannot Access External Services

Symptoms: Pods cannot reach internet or external services.

Understanding Network Restrictions:

By default, pods can only reach services within the same VLAN
Internet access requires proxy configuration or network opening

Solution:

For internet access: Configure HTTP proxy in your pods or request network opening through ITCare
For specific external services: Request network opening between VLANs through ITCare
For external databases/APIs: Verify network policies and firewall rules

Logging and Monitoring Issues

Logs Not Appearing in OpenSearch/ELK

Symptoms: Application logs are not visible in your log aggregation platform.

Diagnosis:

# Check if logging pods are running
kubectl get pods -n cattle-logging-system

# Check buffer size (should not grow continuously)
kubectl -n cattle-logging-system get po -l app.kubernetes.io/name=fluentd -o name | \
  xargs -I {} sh -c "kubectl -n cattle-logging-system exec {} -c fluentd -- du -hs /buffers"

Common Causes and Solutions:

1. Flow/Output Not Configured

Verify Flow and Output/ClusterOutput resources exist for your namespace
Check configuration matches your OpenSearch cluster

2. Conflicting Log Fields

OpenSearch/ELK rejects logs with field type conflicts
Check fluentd logs for "Rejected" messages
See detailed logging configuration in the "Get Started" guide

3. Application Producing Malformed JSON

Application logs must be properly formatted
Consider excluding problematic pods from logging Flow

Migration Issues

Application Fails After Migration to RKE2

Symptoms: Application worked on RKE but fails on RKE2.

Common Causes and Solutions:

1. Deprecated API Versions

Run kubent tool before migration to detect deprecated APIs
Update manifests to use current API versions
See Kubernetes API Deprecation Guide

2. CNI Differences

If migrating to Cilium from Canal, network policies might behave differently
Review and test network policies after migration

3. Missing ConfigMaps or Secrets

Verify all ConfigMaps and Secrets were migrated
Check namespaces and names match exactly

4. External Integration Issues

Update CI/CD pipelines with new cluster kubeconfig
Reconfigure connections to Vault, databases, and other PaaS services
Update monitoring and alerting integrations

Getting Help

If you cannot resolve the issue using this guide:

Self-Service Resources

Check the Kubernetes official documentation
Review Rancher documentation
Consult other sections of this documentation (Features, Get Started, etc.)

Contact Support

For non-urgent issues: Submit a ticket through ITCare
For production incidents: Contact 24x7 support team (if you have 24x7 monitoring option)
For migration assistance: Submit a ticket with subject "RKE to RKE2 Migration Request"

Information to Provide When Requesting Support

To help support diagnose your issue quickly, please provide:

Cluster Information:
- Cluster name
- Region (ET, EB)
- Kubernetes version
Problem Description:
- What you were trying to do
- What happened vs. what you expected
- When the issue started
- Any recent changes (deployments, upgrades, configuration changes)
Relevant Details:
- Namespace and resource names affected
- Error messages from kubectl or Rancher UI
- Output of relevant kubectl describe commands
- Screenshots of Rancher UI errors (if applicable)
Troubleshooting Already Performed:
- Steps you've already tried
- Results of those attempts

Most issues can be resolved quickly with proper diagnostics. Don't hesitate to gather relevant information before submitting a support ticket - it helps the support team assist you faster!

PreviousMigration RKE to RKE2 NextUpgrade

Last updated 2 months ago

hashtagRancher UI Issues

hashtagRancher UI Stuck on "Loading"

hashtagCluster Not Visible After First Login

hashtagCannot Access Rancher (Connection Refused or Timeout)

hashtagkubectl Access Issues

hashtagkubectl Commands Fail with "Connection Refused"

hashtagkubectl Context Not Switching

hashtagCluster Access and Authentication

hashtag"Forbidden" Errors When Running kubectl Commands

hashtagCannot Create Resources in Namespace

hashtagWorkload Issues

hashtagPods Stuck in "Pending" State

hashtagImage Pull Issues (ImagePullBackOff)

hashtagIngress Not Routing Traffic

hashtagPersistent Storage Issues

hashtagPVC Stuck in "Pending" State

hashtagNetwork Policy Issues

hashtagPods Cannot Communicate Between Namespaces

hashtagPods Cannot Access External Services

hashtagLogging and Monitoring Issues

hashtagLogs Not Appearing in OpenSearch/ELK

hashtagMigration Issues

hashtagApplication Fails After Migration to RKE2

hashtagGetting Help

hashtagSelf-Service Resources

hashtagContact Support

hashtagInformation to Provide When Requesting Support

Rancher UI Issues

Rancher UI Stuck on "Loading"

Cluster Not Visible After First Login

Cannot Access Rancher (Connection Refused or Timeout)

kubectl Access Issues

kubectl Commands Fail with "Connection Refused"

kubectl Context Not Switching

Cluster Access and Authentication

"Forbidden" Errors When Running kubectl Commands

Cannot Create Resources in Namespace

Workload Issues

Pods Stuck in "Pending" State

Image Pull Issues (ImagePullBackOff)

Ingress Not Routing Traffic

Persistent Storage Issues

PVC Stuck in "Pending" State

Network Policy Issues

Pods Cannot Communicate Between Namespaces

Pods Cannot Access External Services

Logging and Monitoring Issues

Logs Not Appearing in OpenSearch/ELK

Migration Issues

Application Fails After Migration to RKE2

Getting Help

Self-Service Resources

Contact Support

Information to Provide When Requesting Support