Nutanix Cluster Troubleshooting
In this article you will learn basic Nutanix Cluster Troubleshooting tips and tricks. Learn how to identify which Nutanix Cluster Services is down.
Check back from time to time, as this is a article that is updated on a regular basis.
If you run into issues with the Services in the Nutanix Cluster or the Nutanix CVM it´s good to know what to look for and troubleshoot it. A normal operating Nutanix cluster / CVM services will report that all services are running. Let´s first learn how to check that all is good or not.
The NCC health check cluster_services_down_check verifies if the cluster services are running on all Controller VMs in the cluster. If any cluster service is not running on any of the Controller VMs in the cluster, this check fails.
Running the NCC Check in a Nutanix Cluster
You can run this check as part of the complete NCC Health Checks:
nutanix@cvm$ ncc health_checks run_all
Or you can run this check separately:
nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check
You can also run the checks from the Prism web console Health page: select Actions > Run Checks. Select All checks and click Run.
Please note that this check is scheduled to run every minute, by default.
Here is a list of some issues you might run into performing different operations to the Nutanix Cluster.
Check Nutanix Services in a Nutanix Cluster
Let´s check the status of each Nutanix Service.
Run the following command to confirm if any service is currently down:
nutanix@cvm$ cluster status | grep -v UP
If any services are listed as DOWN, start those services by using the following command:
nutanix@cvm$ cluster start
Here is a list of possible issues one can run into in different scenarios.
Upgrade stuck due to Genesis not able to start services after Cassandra service
Resolution : To resolve the Upgrade stuck due to Genesis not able to start services after Cassandra service – Run following command from any Nutanix CVM in the cluster.
nutanix@cvm$ genesis statusornutanix@cvm$ gs
nutanix@cvm$ allssh 'genesis restart'
Using the following command is a simple way to check to see if any services are crashing on a single node in the cluster.
nutanix@cvm$ watch -d genesis status
Note: if the cluster is large then you may have to execute the command one node at a time, hence it may be a wise decision to correlate with the alert on what CVM so as to check to see which services are reported to be crashing.
Unreachable DNS server can prevent 2 node clusters from starting services after failure
Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.
Command 1: Check DNS / Name server entry on cluster configuration
nutanix@cvm:~$ zeus_config_printer | grep name_server
Command 2: then check the DNS / Name server entry in all CVMs configuration file.
nutanix@cvm:~$ allssh "cat /etc/resolv.conf"
If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.
Make sure DNS server is reachable before putting DNS IP address / host name.
Expand CVM Memory
Some times you want to check how much memory is allocated to the Nutanix CVM via CLI. There is also an option in Nutanix Prism to expand memory allocation for the CVM.
Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.
nutanix@cvm$ free -m
Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.
Increase / scale-up Nutlanix CVM memory from Prism console
Step 3: Restart Genesis service on all Nutanix CVMs
nutanix@cvm$ allssh genesis restart nutanix@cvm$ allssh genesis stop prism nutanix@cvm$ cluster start
Nutanix CVM / Cluster Services are down
Let’s troubleshoot the Nutanix Cluster / CVM services down issue.
Here are a list of the most Critical Nutanix services:
- acropolis
- andruil
- aplos
- aplos_engine
- catalog
- cluster_config
- cluster_sync
- delphi
- ergon
- flow
- lazan
- minerva_cvm
- snmp_manager
- sys_stat_collector
- uhura
- xtrim
Resolution: check the Nutanix CVM / Cluster services status and restart them.
Step 1: Check Nutanix CVM / Cluster services status
nutanix@CVM$ ncc health_checks run_all nutanix@CVM$ ncc health_checks system_checks cluster_services_status nutanix@CVM$ ncc health_checks system_checks cvm_services_status nutanix@cvm$ ncc health_checks hypervisor_checks check_services nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check
The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.
The following services are checked:
- alert_manager
- arithmos
- cassandra_monitor
- cerebro
- chronos_node_main
- cluster_manager_monitor
- hyperint_monitor
- pithos
- prism_monitor
- stargate
- stargate_monitor_main
- stats_aggregator_monitor
- zookeeper_monitor
- curator
Step 2: Shortlist the down services on all Nutanix CVM
nutanix@pcvm$ cluster status | grep -v UP
Step 3: Start Nutanix CVM / Cluster services
nutanix@pcvm$ cluster start
Note: Above command will not impact your production running VMs.
Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.
Step 4.1: Reboot Nutanix CVM
nutanix@cvm$ cvm_shutdown -r now
Step 4.1.1: OR Shutdown Nutanix CVM
nutanix@cvm$ cvm_shutdown -P now
Step 4.1.2: Power-on the Shudown Nutanix CVM
SSH to Nuanix AHV host
root# virsh list --all | grep CVM
In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM
root# virsh start <CVM_Name>
Wait for 5 Minutes to boot-up the Nutanix CVM and services.
OR Step 4.2 : You can put your host in maintenance mode and then reboot node
Final Step : Now check Nutanix cluster status and running services.
nutanix@pcvm$ cluster status
Nutanix Gateway not reachable. Http request error
Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.
Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.
nutanix@cvm$ curl http://0:2019/prism/leader && echo
Output should look similar as following
{"leader":"xx.xx.xx.10:9080", "is_local":false}
It means xx.xx.xx.10 CVM is the Prism Leader.
Step 2: SSH to Prism Leader and run the following command to restart Prism service.
nutanix@cvm$ genesis stop prism nutanix@cvm$ cluster start
Note: There is no impact on running production of above commands.
Tasks
The following command can be used to check out the tasks in the Prism web console that have not been completed.
nutanix@cvm$ ecli task.list include_completed=false
LCM upgrade fails with error “Services not up” on a 2-node cluster
Resolution: Above both services run in LCM framework.
This is known issue. therefore it is recommended to upgrade Nutanix NCC a
Upgrades
Upgrades at Nutanix are always designed to be done without needing any downtime for User VMs and their workloads. The following article is intended to serve as an introduction describing how each type of upgrade works and to share some useful best practices for administrators. You will find similar information in the Acropolis Upgrade Guide (remember to always choose the guide that matches the AOS currently running on your cluster). The link to the article can be found here.
Check upgrade status
nutanix@cvm$ upgrade_status
Hypervisor upgrade status
Description: Check hypervisor upgrade status from the CLI on any CVM
nutanix@cvm$ host_upgrade_status
LCM upgrade status
nutanix@cvm$ lcm_upgrade_status
Detailed logs (on every CVM)
nutanix@cvm$ less ~/data/logs/host_upgrade.out
Note: In order to understand how upgrades work, please refer to the following link to the article which can be found here.
Logs
Find cluster error logs
Description: Find ERROR logs for the cluster
nutanix@cvm$ allssh "cat ~/data/logs/<COMPONENT NAME or *>.ERROR"
Example for Stargate
nutanix@cvm$ allssh "cat ~/data/logs/stargate.ERROR"
Find cluster fatal logs
Description: Find FATAL logs for the cluster
nutanix@cvm$ allssh "cat ~/data/logs/<COMPONENT NAME or *>.FATAL"
Example for Stargate
nutanix@cvm$ allssh "cat ~/data/logs/stargate.FATAL"
Similarly, you can also run the following script to list the fatals across all the nodes in the cluster
nutanix@cvm$ for i in `svmips`; do echo "CVM: $i"; ssh $i "ls -ltr /home/nutanix/data/logs/*.FATAL"; done
Here is a list of KBs that might come in handy when doing Nutanix Cluster Troubleshooting
Maintenance
The following article provides commands for entering and exiting maintenance mode for CVM (Controller VM) and hypervisor. The link can be found here.
How to Shutdown Nutanix CVM and Nutanix AHV Host
How to Fix a Nutanix CVM being Stuck in Maintenance Mode
cvm_services_status verifies if a service has crashed and generated a core dump in the last 15 minutes. | KB 2472 |
cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster. | KB 3378 |
check_ntp verifies the NTP configuration of the CVMs (Controller VMs) and hypervisor hosts, and also checks if there are any time drifts on the cluster. | KB 4519 |