Troubleshooting

This guide helps you diagnose and resolve common issues encountered when working with cegedim.cloud Kubernetes clusters.

Rancher UI Issues

Rancher UI Stuck on "Loading"

Symptoms: After logging in, the Rancher UI displays "Loading" indefinitely.

Solution:

  1. Try accessing the direct dashboard URL:

  2. If the issue persists, log out and log back in

  3. Try using a different browser or incognito/private mode

  4. Clear your browser cache and cookies for the Rancher domain

Cluster Not Visible After First Login

Symptoms: After your first login to Rancher, your cluster doesn't appear in the cluster list.

Solution:

  1. Log out of Rancher completely

  2. Log back in

  3. Your cluster should now appear in the list

  4. If the cluster still doesn't appear, verify your access rights through ITCare or contact your administrator

Cannot Access Rancher (Connection Refused or Timeout)

Symptoms: Unable to reach rancher-et.cegedim.cloud or rancher-eb.cegedim.cloud.

Solution:

  1. Check network access: Some Rancher instances are only accessible from the server network

    • rancher-et.cegedim.cloud - Requires server network access (connect through bastion)

    • rancher-eb.cegedim.cloud (production) - Requires server network access (connect through bastion)

    • rancher-eb-qa.cegedim.cloud (non-production) - Accessible from standard network

  2. Verify Rancher status: Check if a Rancher upgrade is in progress (typically 15-30 minutes)

kubectl Access Issues

kubectl Commands Fail with "Connection Refused"

Symptoms: kubectl commands return connection errors or timeouts.

Possible Causes and Solutions:

1. Rancher Proxy Issue

  • If your kubeconfig uses Rancher URL, Rancher might be down or upgrading

  • Wait for Rancher to become available again

  • Consider using direct cluster access if available

2. Invalid or Expired Credentials

  • Download a fresh kubeconfig from Rancher UI

  • Verify your token hasn't expired (check token lifecycle in Rancher)

3. Network Connectivity

  • Test connectivity: curl -v https://<rancher-url>

  • Verify you're on the correct network (bastion for ET/EB production)

  • Check firewall rules and proxy settings

kubectl Context Not Switching

Symptoms: kubectl commands affect the wrong cluster.

Solution:

Cluster Access and Authentication

"Forbidden" Errors When Running kubectl Commands

Symptoms: Commands return "Error from server (Forbidden): is forbidden".

Solution:

  1. Verify your access rights in Rancher: Check the "Manage Rights" page for your Project/Cluster permissions

  2. Use SelfSubjectAccessReview: Run the following command to check your permissions for specific resources:

  1. Check Project/Namespace permissions: Ensure you have the correct role in the Project

  2. Verify AD group membership: Confirm you're in the correct G_K8_* groups

  3. Check token scope: Ensure you're using a cluster-scoped token for kubectl operations

Cannot Create Resources in Namespace

Symptoms: Permission denied when creating pods, deployments, etc.

Solution:

  1. Verify the namespace belongs to a Project you have access to

  2. If the namespace was created via kubectl (not Rancher UI), it may be in the "Default" project with restricted access

  3. Contact your Project admin to move the namespace to the correct Project or grant permissions

Workload Issues

Pods Stuck in "Pending" State

Symptoms: Pods remain in "Pending" status and don't start.

Diagnosis:

Common Causes and Solutions:

1. Insufficient Resources

  • Message: "Insufficient cpu" or "Insufficient memory"

  • Solution: Request more nodes through ITCare or reduce resource requests

2. Persistent Volume Issues

  • Message: "persistentvolumeclaim not found" or "no persistent volumes available"

  • Solution: Verify PVC exists and storage class is correct

3. Node Selector/Affinity Mismatch

  • Message: "No nodes are available that match all of the following predicates"

  • Solution: Review nodeSelector and affinity rules

4. Image Pull Errors

  • Message: "Failed to pull image" or "ImagePullBackOff"

  • Solution: See "Image Pull Issues" section below

Image Pull Issues (ImagePullBackOff)

Symptoms: Pods fail with "ImagePullBackOff" or "ErrImagePull" status.

Diagnosis:

Common Causes and Solutions:

1. Private Registry Authentication

  • Create or verify image pull secret exists

  • Ensure secret is referenced in pod spec or service account

2. Image Name Typo

  • Verify image name and tag are correct

  • Check registry URL is properly formatted

3. Network Connectivity to Registry

  • Verify cluster can reach external registry

  • Check if network policies block registry access

  • Request network opening through ITCare if needed

Ingress Not Routing Traffic

Symptoms: Cannot access application through ingress URL.

Diagnosis:

Common Causes and Solutions:

1. Incorrect Ingress Class

  • For Nginx (default): No class annotation needed or use kubernetes.io/ingress.class: "nginx"

  • For Nginx external: Use kubernetes.io/ingress.class: "nginx-ext"

  • For Traefik: Use appropriate Traefik ingress class

  • For Istio: Use Istio Gateway configuration

2. Service Not Found or Misconfigured

  • Verify service name and port match ingress backend

  • Check that service has endpoints (pods selected)

3. Certificate Issues

  • Default: *.yourclustername.ccs.cegedim.cloud certificate is pre-configured

  • Custom domains: Request certificate configuration through ITCare

Persistent Storage Issues

PVC Stuck in "Pending" State

Symptoms: PersistentVolumeClaim remains "Pending" and pods cannot start.

Diagnosis:

Common Causes and Solutions:

1. Storage Class Not Found

  • Verify storage class name in PVC

  • List available storage classes: kubectl get storageclass

  • Use Ceph-based storage classes provided by cegedim.cloud

2. Storage Quota Exceeded

  • Check if storage quota is available

  • Request additional storage through ITCare

3. Ceph CSI Not Available

  • Verify Ceph CSI is enabled for your cluster

  • Contact support if Ceph CSI is not provisioned

Network Policy Issues

Pods Cannot Communicate Between Namespaces

Symptoms: Pods in different namespaces cannot reach each other.

Understanding Rancher Project Network Isolation:

  • Pods in namespaces within the same Rancher Project can communicate by default

  • Pods in namespaces in different Rancher Projects cannot communicate unless explicitly allowed

Solution:

  1. Option 1: Move namespaces to the same Rancher Project (if appropriate)

  2. Option 2: Create a NetworkPolicy to explicitly allow cross-project communication

  3. Note: Pods in the "System" project can communicate with all other projects

Pods Cannot Access External Services

Symptoms: Pods cannot reach internet or external services.

Understanding Network Restrictions:

  • By default, pods can only reach services within the same VLAN

  • Internet access requires proxy configuration or network opening

Solution:

  1. For internet access: Configure HTTP proxy in your pods or request network opening through ITCare

  2. For specific external services: Request network opening between VLANs through ITCare

  3. For external databases/APIs: Verify network policies and firewall rules

Logging and Monitoring Issues

Logs Not Appearing in OpenSearch/ELK

Symptoms: Application logs are not visible in your log aggregation platform.

Diagnosis:

Common Causes and Solutions:

1. Flow/Output Not Configured

  • Verify Flow and Output/ClusterOutput resources exist for your namespace

  • Check configuration matches your OpenSearch cluster

2. Conflicting Log Fields

  • OpenSearch/ELK rejects logs with field type conflicts

  • Check fluentd logs for "Rejected" messages

  • See detailed logging configuration in the "Get Started" guide

3. Application Producing Malformed JSON

  • Application logs must be properly formatted

  • Consider excluding problematic pods from logging Flow

Migration Issues

Application Fails After Migration to RKE2

Symptoms: Application worked on RKE but fails on RKE2.

Common Causes and Solutions:

1. Deprecated API Versions

2. CNI Differences

  • If migrating to Cilium from Canal, network policies might behave differently

  • Review and test network policies after migration

3. Missing ConfigMaps or Secrets

  • Verify all ConfigMaps and Secrets were migrated

  • Check namespaces and names match exactly

4. External Integration Issues

  • Update CI/CD pipelines with new cluster kubeconfig

  • Reconfigure connections to Vault, databases, and other PaaS services

  • Update monitoring and alerting integrations

Getting Help

If you cannot resolve the issue using this guide:

Self-Service Resources

Contact Support

  • For non-urgent issues: Submit a ticket through ITCare

  • For production incidents: Contact 24x7 support team (if you have 24x7 monitoring option)

  • For migration assistance: Submit a ticket with subject "RKE to RKE2 Migration Request"

Information to Provide When Requesting Support

To help support diagnose your issue quickly, please provide:

  1. Cluster Information:

    • Cluster name

    • Region (ET, EB)

    • Kubernetes version

  2. Problem Description:

    • What you were trying to do

    • What happened vs. what you expected

    • When the issue started

    • Any recent changes (deployments, upgrades, configuration changes)

  3. Relevant Details:

    • Namespace and resource names affected

    • Error messages from kubectl or Rancher UI

    • Output of relevant kubectl describe commands

    • Screenshots of Rancher UI errors (if applicable)

  4. Troubleshooting Already Performed:

    • Steps you've already tried

    • Results of those attempts

Last updated