AOS/ Nutanix/ Nutanix AHV/ Nutanix Cloud Platform

Nutanix Cluster Troubleshooting

Alexander Ervik Johnsen AHV, allssh, CVM, genesis, NCC, Nutanix, Nutanix Cluster, Nutanix Services, troubleshooting 2022-09-06

In this article you will learn basic Nutanix Cluster Troubleshooting tips and tricks. Learn how to identify which Nutanix Cluster Services is down.

Check back from time to time, as this is a article that is updated on a regular basis.

If you run into issues with the Services in the Nutanix Cluster or the Nutanix CVM it´s good to know what to look for and troubleshoot it. A normal operating Nutanix cluster / CVM services will report that all services are running. Let´s first learn how to check that all is good or not.

The NCC health check cluster_services_down_check verifies if the cluster services are running on all Controller VMs in the cluster. If any cluster service is not running on any of the Controller VMs in the cluster, this check fails.

Running the NCC Check in a Nutanix Cluster
You can run this check as part of the complete NCC Health Checks:

nutanix@cvm$ ncc health_checks run_all

Or you can run this check separately:

nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check

You can also run the checks from the Prism web console Health page: select Actions > Run Checks. Select All checks and click Run.

Please note that this check is scheduled to run every minute, by default.

Here is a list of some issues you might run into performing different operations to the Nutanix Cluster.

Check Nutanix Services in a Nutanix Cluster

Let´s check the status of each Nutanix Service.

Run the following command to confirm if any service is currently down:

nutanix@cvm$ cluster status | grep -v UP

This picture shows a normal operating Nutanix Cluster

If any services are listed as DOWN, start those services by using the following command:

nutanix@cvm$ cluster start

Here is a list of possible issues one can run into in different scenarios.

Upgrade stuck due to Genesis not able to start services after Cassandra service

Resolution : To resolve the Upgrade stuck due to Genesis not able to start services after Cassandra service – Run following command from any Nutanix CVM in the cluster.

nutanix@cvm$ genesis statusornutanix@cvm$ gs

nutanix@cvm$ allssh 'genesis restart'

Using the following command is a simple way to check to see if any services are crashing on a single node in the cluster.

nutanix@cvm$ watch -d genesis status

Note: if the cluster is large then you may have to execute the command one node at a time, hence it may be a wise decision to correlate with the alert on what CVM so as to check to see which services are reported to be crashing.

Unreachable DNS server can prevent 2 node clusters from starting services after failure

Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.

Command 1: Check DNS / Name server entry on cluster configuration

nutanix@cvm:~$ zeus_config_printer | grep name_server

Command 2: then check the DNS / Name server entry in all CVMs configuration file.

nutanix@cvm:~$ allssh "cat /etc/resolv.conf"

If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.

Make sure DNS server is reachable before putting DNS IP address / host name.

Expand CVM Memory

Some times you want to check how much memory is allocated to the Nutanix CVM via CLI. There is also an option in Nutanix Prism to expand memory allocation for the CVM.

Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.

nutanix@cvm$ free -m

Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.

Increase / scale-up Nutlanix CVM memory from Prism console

Step 3: Restart Genesis service on all Nutanix CVMs

nutanix@cvm$ allssh genesis restart
nutanix@cvm$ allssh genesis stop prism
nutanix@cvm$ cluster start

Nutanix CVM / Cluster Services are down

Let’s troubleshoot the Nutanix Cluster / CVM services down issue.

Here are a list of the most Critical Nutanix services:

acropolis
andruil
aplos
aplos_engine
catalog
cluster_config
cluster_sync
delphi
ergon
flow
lazan
minerva_cvm
snmp_manager
sys_stat_collector
uhura
xtrim

Resolution: check the Nutanix CVM / Cluster services status and restart them.

Step 1: Check Nutanix CVM / Cluster services status

nutanix@CVM$ ncc health_checks run_all
nutanix@CVM$ ncc health_checks system_checks cluster_services_status
nutanix@CVM$ ncc health_checks system_checks cvm_services_status
nutanix@cvm$ ncc health_checks hypervisor_checks check_services
nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check

The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.

The following services are checked:

alert_manager
arithmos
cassandra_monitor
cerebro
chronos_node_main
cluster_manager_monitor
hyperint_monitor
pithos
prism_monitor
stargate
stargate_monitor_main
stats_aggregator_monitor
zookeeper_monitor
curator

Step 2: Shortlist the down services on all Nutanix CVM

nutanix@pcvm$ cluster status | grep -v UP

Step 3: Start Nutanix CVM / Cluster services

nutanix@pcvm$ cluster start

Note: Above command will not impact your production running VMs.

Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.

Step 4.1: Reboot Nutanix CVM

nutanix@cvm$ cvm_shutdown -r now

Step 4.1.1: OR Shutdown Nutanix CVM

nutanix@cvm$ cvm_shutdown -P now

Step 4.1.2: Power-on the Shudown Nutanix CVM

SSH to Nuanix AHV host

root# virsh list --all | grep CVM

In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM

root# virsh start <CVM_Name>

Wait for 5 Minutes to boot-up the Nutanix CVM and services.

OR Step 4.2 : You can put your host in maintenance mode and then reboot node

Final Step : Now check Nutanix cluster status and running services.

nutanix@pcvm$ cluster status

Nutanix Gateway not reachable. Http request error

Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.

Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.

nutanix@cvm$ curl http://0:2019/prism/leader && echo

Output should look similar as following

{"leader":"xx.xx.xx.10:9080", "is_local":false}

It means xx.xx.xx.10 CVM is the Prism Leader.

Step 2: SSH to Prism Leader and run the following command to restart Prism service.

nutanix@cvm$ genesis stop prism 
nutanix@cvm$ cluster start

Note: There is no impact on running production of above commands.

Tasks

The following command can be used to check out the tasks in the Prism web console that have not been completed.

nutanix@cvm$ ecli task.list include_completed=false

LCM upgrade fails with error “Services not up” on a 2-node cluster

Resolution: Above both services run in LCM framework.

This is known issue. therefore it is recommended to upgrade Nutanix NCC a

Upgrades

Upgrades at Nutanix are always designed to be done without needing any downtime for User VMs and their workloads. The following article is intended to serve as an introduction describing how each type of upgrade works and to share some useful best practices for administrators. You will find similar information in the Acropolis Upgrade Guide (remember to always choose the guide that matches the AOS currently running on your cluster). The link to the article can be found here.

Check upgrade status

nutanix@cvm$ upgrade_status

Hypervisor upgrade status

Description: Check hypervisor upgrade status from the CLI on any CVM

nutanix@cvm$ host_upgrade_status

LCM upgrade status

nutanix@cvm$ lcm_upgrade_status

Detailed logs (on every CVM)

nutanix@cvm$ less ~/data/logs/host_upgrade.out

Note: In order to understand how upgrades work, please refer to the following link to the article which can be found here.

Logs

Find cluster error logs

Description: Find ERROR logs for the cluster

nutanix@cvm$ allssh "cat ~/data/logs/<COMPONENT NAME or *>.ERROR"

Example for Stargate

nutanix@cvm$ allssh "cat ~/data/logs/stargate.ERROR"

Find cluster fatal logs

Description: Find FATAL logs for the cluster

nutanix@cvm$ allssh "cat ~/data/logs/<COMPONENT NAME or *>.FATAL"

Example for Stargate

nutanix@cvm$ allssh "cat ~/data/logs/stargate.FATAL"

Similarly, you can also run the following script to list the fatals across all the nodes in the cluster

nutanix@cvm$ for i in `svmips`; do echo "CVM: $i"; ssh $i "ls -ltr /home/nutanix/data/logs/*.FATAL"; done

Here is a list of KBs that might come in handy when doing Nutanix Cluster Troubleshooting

Maintenance

The following article provides commands for entering and exiting maintenance mode for CVM (Controller VM) and hypervisor. The link can be found here.

How to Shutdown Nutanix CVM and Nutanix AHV Host

How to Fix a Nutanix CVM being Stuck in Maintenance Mode

cvm_services_status verifies if a service has crashed and generated a core dump in the last 15 minutes.	KB 2472
cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.	KB 3378

check_ntp verifies the NTP configuration of the CVMs (Controller VMs) and hypervisor hosts, and also checks if there are any time drifts on the cluster.

KB 4519