# Troubleshooting

This guide helps you diagnose and resolve common issues encountered when working with **cegedim.cloud** Kubernetes clusters.

## Rancher UI Issues <a href="#k8stroubleshooting-rancheruiissues" id="k8stroubleshooting-rancheruiissues"></a>

### Rancher UI Stuck on "Loading"

**Symptoms**: After logging in, the Rancher UI displays "Loading" indefinitely.

**Solution**:

1. Try accessing the direct dashboard URL:
   * For ET region: <https://rancher-et.cegedim.cloud/dashboard/home>
   * For EB production: <https://rancher-eb.cegedim.cloud/dashboard/home>
   * For EB non-production: <https://rancher-eb-qa.cegedim.cloud/dashboard/home>
2. If the issue persists, log out and log back in
3. Try using a different browser or **incognito/private mode**
4. Clear your browser cache and cookies for the Rancher domain

### Cluster Not Visible After First Login

**Symptoms**: After your first login to Rancher, your cluster doesn't appear in the cluster list.

**Solution**:

1. Log out of Rancher completely
2. Log back in
3. Your cluster should now appear in the list
4. If the cluster still doesn't appear, verify your access rights through ITCare or contact your administrator

### Cannot Access Rancher (Connection Refused or Timeout)

**Symptoms**: Unable to reach rancher-et.cegedim.cloud or rancher-eb.cegedim.cloud.

**Solution**:

1. **Check network access**: Some Rancher instances are only accessible from the server network
   * **rancher-et.cegedim.cloud** - Requires server network access (connect through bastion)
   * **rancher-eb.cegedim.cloud** (production) - Requires server network access (connect through bastion)
   * **rancher-eb-qa.cegedim.cloud** (non-production) - Accessible from standard network
2. **Verify Rancher status**: Check if a Rancher upgrade is in progress (typically 15-30 minutes)

## kubectl Access Issues <a href="#k8stroubleshooting-kubectlaccessissues" id="k8stroubleshooting-kubectlaccessissues"></a>

### kubectl Commands Fail with "Connection Refused"

**Symptoms**: kubectl commands return connection errors or timeouts.

**Possible Causes and Solutions**:

**1. Rancher Proxy Issue**

* If your kubeconfig uses Rancher URL, Rancher might be down or upgrading
* Wait for Rancher to become available again
* Consider using direct cluster access if available

**2. Invalid or Expired Credentials**

* Download a fresh kubeconfig from Rancher UI
* Verify your token hasn't expired (check token lifecycle in Rancher)

**3. Network Connectivity**

* Test connectivity: `curl -v https://<rancher-url>`
* Verify you're on the correct network (bastion for ET/EB production)
* Check firewall rules and proxy settings

### kubectl Context Not Switching

**Symptoms**: kubectl commands affect the wrong cluster.

**Solution**:

```bash
# List all available contexts
kubectl config get-contexts

# Switch to the correct context
kubectl config use-context <context-name>

# Verify current context
kubectl config current-context
```

## Cluster Access and Authentication <a href="#k8stroubleshooting-clusteraccessandauthentication" id="k8stroubleshooting-clusteraccessandauthentication"></a>

### "Forbidden" Errors When Running kubectl Commands

**Symptoms**: Commands return "Error from server (Forbidden): is forbidden".

**Solution**:

1. **Verify your access rights in Rancher**: Check the "Manage Rights" page for your Project/Cluster permissions
2. **Use SelfSubjectAccessReview**: Run the following command to check your permissions for specific resources:

```bash
kubectl create -f - -o yaml << EOF
apiVersion: authorization.k8s.io/v1
kind: SelfSubjectAccessReview
spec:
  resourceAttributes:
    group: ""
    resource: "*"
    verb: "*"
EOF
```

3. **Check Project/Namespace permissions**: Ensure you have the correct role in the Project
4. **Verify AD group membership**: Confirm you're in the correct G\_K8\_\* groups
5. **Check token scope**: Ensure you're using a cluster-scoped token for kubectl operations

### Cannot Create Resources in Namespace

**Symptoms**: Permission denied when creating pods, deployments, etc.

**Solution**:

1. Verify the namespace belongs to a Project you have access to
2. If the namespace was created via kubectl (not Rancher UI), it may be in the "Default" project with restricted access
3. Contact your Project admin to move the namespace to the correct Project or grant permissions

## Workload Issues <a href="#k8stroubleshooting-workloadissues" id="k8stroubleshooting-workloadissues"></a>

### Pods Stuck in "Pending" State

**Symptoms**: Pods remain in "Pending" status and don't start.

**Diagnosis**:

```bash
# Check pod details
kubectl describe pod <pod-name> -n <namespace>

# Look for events at the bottom of the output
```

**Common Causes and Solutions**:

**1. Insufficient Resources**

* Message: "Insufficient cpu" or "Insufficient memory"
* Solution: Request more nodes through ITCare or reduce resource requests

**2. Persistent Volume Issues**

* Message: "persistentvolumeclaim not found" or "no persistent volumes available"
* Solution: Verify PVC exists and storage class is correct

**3. Node Selector/Affinity Mismatch**

* Message: "No nodes are available that match all of the following predicates"
* Solution: Review nodeSelector and affinity rules

**4. Image Pull Errors**

* Message: "Failed to pull image" or "ImagePullBackOff"
* Solution: See "Image Pull Issues" section below

### Image Pull Issues (ImagePullBackOff)

**Symptoms**: Pods fail with "ImagePullBackOff" or "ErrImagePull" status.

**Diagnosis**:

```bash
kubectl describe pod <pod-name> -n <namespace>
# Look for "Failed to pull image" messages
```

**Common Causes and Solutions**:

**1. Private Registry Authentication**

* Create or verify image pull secret exists
* Ensure secret is referenced in pod spec or service account

**2. Image Name Typo**

* Verify image name and tag are correct
* Check registry URL is properly formatted

**3. Network Connectivity to Registry**

* Verify cluster can reach external registry
* Check if network policies block registry access
* Request network opening through ITCare if needed

### Ingress Not Routing Traffic

**Symptoms**: Cannot access application through ingress URL.

**Diagnosis**:

```bash
# Check ingress configuration
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>

# Verify service and endpoints
kubectl get svc -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
```

**Common Causes and Solutions**:

**1. Incorrect Ingress Class**

* For Nginx (default): No class annotation needed or use `kubernetes.io/ingress.class: "nginx"`
* For Nginx external: Use `kubernetes.io/ingress.class: "nginx-ext"`
* For Traefik: Use appropriate Traefik ingress class
* For Istio: Use Istio Gateway configuration

**2. Service Not Found or Misconfigured**

* Verify service name and port match ingress backend
* Check that service has endpoints (pods selected)

**3. Certificate Issues**

* Default: `*.yourclustername.ccs.cegedim.cloud` certificate is pre-configured
* Custom domains: Request certificate configuration through ITCare

## Persistent Storage Issues <a href="#k8stroubleshooting-persistentstorageissues" id="k8stroubleshooting-persistentstorageissues"></a>

### PVC Stuck in "Pending" State

**Symptoms**: PersistentVolumeClaim remains "Pending" and pods cannot start.

**Diagnosis**:

```bash
kubectl describe pvc <pvc-name> -n <namespace>
# Look for error messages in Events
```

**Common Causes and Solutions**:

**1. Storage Class Not Found**

* Verify storage class name in PVC
* List available storage classes: `kubectl get storageclass`
* Use Ceph-based storage classes provided by cegedim.cloud

**2. Storage Quota Exceeded**

* Check if storage quota is available
* Request additional storage through ITCare

**3. Ceph CSI Not Available**

* Verify Ceph CSI is enabled for your cluster
* Contact support if Ceph CSI is not provisioned

## Network Policy Issues <a href="#k8stroubleshooting-networkpolicyissues" id="k8stroubleshooting-networkpolicyissues"></a>

### Pods Cannot Communicate Between Namespaces

**Symptoms**: Pods in different namespaces cannot reach each other.

**Understanding Rancher Project Network Isolation**:

* Pods in namespaces within the **same Rancher Project** can communicate by default
* Pods in namespaces in **different Rancher Projects** cannot communicate unless explicitly allowed

**Solution**:

1. **Option 1**: Move namespaces to the same Rancher Project (if appropriate)
2. **Option 2**: Create a NetworkPolicy to explicitly allow cross-project communication
3. **Note**: Pods in the "System" project can communicate with all other projects

### Pods Cannot Access External Services

**Symptoms**: Pods cannot reach internet or external services.

**Understanding Network Restrictions**:

* By default, pods can only reach services within the same VLAN
* Internet access requires proxy configuration or network opening

**Solution**:

1. **For internet access**: Configure HTTP proxy in your pods or request network opening through ITCare
2. **For specific external services**: Request network opening between VLANs through ITCare
3. **For external databases/APIs**: Verify network policies and firewall rules

## Logging and Monitoring Issues <a href="#k8stroubleshooting-loggingandmonitoringissues" id="k8stroubleshooting-loggingandmonitoringissues"></a>

### Logs Not Appearing in OpenSearch/ELK

**Symptoms**: Application logs are not visible in your log aggregation platform.

**Diagnosis**:

```bash
# Check if logging pods are running
kubectl get pods -n cattle-logging-system

# Check buffer size (should not grow continuously)
kubectl -n cattle-logging-system get po -l app.kubernetes.io/name=fluentd -o name | \
  xargs -I {} sh -c "kubectl -n cattle-logging-system exec {} -c fluentd -- du -hs /buffers"
```

**Common Causes and Solutions**:

**1. Flow/Output Not Configured**

* Verify Flow and Output/ClusterOutput resources exist for your namespace
* Check configuration matches your OpenSearch cluster

**2. Conflicting Log Fields**

* OpenSearch/ELK rejects logs with field type conflicts
* Check fluentd logs for "Rejected" messages
* See detailed logging configuration in the "Get Started" guide

**3. Application Producing Malformed JSON**

* Application logs must be properly formatted
* Consider excluding problematic pods from logging Flow

## Migration Issues <a href="#k8stroubleshooting-migrationissues" id="k8stroubleshooting-migrationissues"></a>

### Application Fails After Migration to RKE2

**Symptoms**: Application worked on RKE but fails on RKE2.

**Common Causes and Solutions**:

**1. Deprecated API Versions**

* Run `kubent` tool before migration to detect deprecated APIs
* Update manifests to use current API versions
* See [Kubernetes API Deprecation Guide](https://kubernetes.io/docs/reference/using-api/deprecation-guide/)

**2. CNI Differences**

* If migrating to Cilium from Canal, network policies might behave differently
* Review and test network policies after migration

**3. Missing ConfigMaps or Secrets**

* Verify all ConfigMaps and Secrets were migrated
* Check namespaces and names match exactly

**4. External Integration Issues**

* Update CI/CD pipelines with new cluster kubeconfig
* Reconfigure connections to Vault, databases, and other PaaS services
* Update monitoring and alerting integrations

## Getting Help <a href="#k8stroubleshooting-gettinghelp" id="k8stroubleshooting-gettinghelp"></a>

If you cannot resolve the issue using this guide:

### Self-Service Resources

* Check the [Kubernetes official documentation](https://kubernetes.io/docs/home/)
* Review [Rancher documentation](https://ranchermanager.docs.rancher.com/)
* Consult other sections of this documentation (Features, Get Started, etc.)

### Contact Support

* **For non-urgent issues**: Submit a ticket through [ITCare](https://itcare.cegedim.cloud)
* **For production incidents**: Contact 24x7 support team (if you have 24x7 monitoring option)
* **For migration assistance**: Submit a ticket with subject "RKE to RKE2 Migration Request"

### Information to Provide When Requesting Support

To help support diagnose your issue quickly, please provide:

1. **Cluster Information**:
   * Cluster name
   * Region (ET, EB)
   * Kubernetes version
2. **Problem Description**:
   * What you were trying to do
   * What happened vs. what you expected
   * When the issue started
   * Any recent changes (deployments, upgrades, configuration changes)
3. **Relevant Details**:
   * Namespace and resource names affected
   * Error messages from kubectl or Rancher UI
   * Output of relevant kubectl describe commands
   * Screenshots of Rancher UI errors (if applicable)
4. **Troubleshooting Already Performed**:
   * Steps you've already tried
   * Results of those attempts

{% hint style="success" %}
Most issues can be resolved quickly with proper diagnostics. Don't hesitate to gather relevant information before submitting a support ticket - it helps the support team assist you faster!
{% endhint %}
