Chef is a core configuration management tool for many production environments, but HA cluster failures can be difficult to diagnose because operational guidance is sparse.
This note collects commands and recovery steps for a high-availability Chef cluster on AWS. The same patterns apply on other cloud platforms such as DigitalOcean or Google Cloud.
If you are planning a new deployment, review the AWS Native Chef Server Cluster reference architecture first. It was designed and tested for AWS services.
Operational context
- When to use this: Chef Server is degraded, nodes cannot converge reliably, or the backend cluster reports unhealthy Elasticsearch or service status.
- What it reduces: prolonged configuration-management outage caused by shard allocation, backend service drift, or unclear cluster state.
- Tradeoff: these commands are recovery tools, not a replacement for backups, monitoring, and tested failover procedures.
- Version scope: the Elasticsearch commands apply to Chef Server deployments that actually use an accessible Elasticsearch backend for search. Newer Chef Automate, externalized search, or different backend topologies may require different commands.
Start with service health
Run the standard Chef status checks before changing the backend:
sudo chef-server-ctl status
sudo chef-server-ctl reconfigure
sudo chef-server-ctl tail
If the installation uses Automate or a split backend/frontend topology, run equivalent status checks on the nodes that host PostgreSQL, Elasticsearch/OpenSearch, RabbitMQ, and Bookshelf.
Look for:
- failed or restarting Elasticsearch services
- disk pressure on backend nodes
- shard allocation warnings
- frontend nodes failing to reach backend services
- repeated 5xx errors from Chef API endpoints
Check Elasticsearch cluster health
Chef Server historically used Elasticsearch for search. When search is red or shards are unassigned, Chef can appear partially healthy while searches, roles, environments, or node queries fail.
First confirm where the search backend runs and whether it is reachable from the node where you execute these commands. Do not assume Elasticsearch is listening on 127.0.0.1:9200 in every Chef deployment.
curl -s http://127.0.0.1:9200/_cluster/health?pretty
curl -s http://127.0.0.1:9200/_cat/indices?v
curl -s http://127.0.0.1:9200/_cat/shards?v
Important fields:
status:greenis healthy,yellowmeans replicas are missing,redmeans primary shards are unavailable.unassigned_shards: any non-zero value needs investigation._cat/shards: shows which index and shard failed to allocate.
Investigate unassigned shards
Ask Elasticsearch why a shard is unassigned:
curl -s -XGET "http://127.0.0.1:9200/_cluster/allocation/explain?pretty" \
-H "Content-Type: application/json" \
-d '{}'
On some Elasticsearch versions this empty request returns a useful explanation for the first unassigned shard. On others, you may need to pass the index, shard, and primary/replica details from _cat/shards.
Common causes include low disk watermark, missing nodes, incompatible shard copies, or previous allocation failures.
If disk pressure is the cause, free space first. Do not keep forcing allocation while the cluster is over its high watermark.
Retry failed allocation
After fixing the underlying cause, retry failed allocations:
curl -s -XPOST "http://127.0.0.1:9200/_cluster/reroute?retry_failed=true&pretty"
Then re-check health:
curl -s http://127.0.0.1:9200/_cluster/health?pretty
curl -s http://127.0.0.1:9200/_cat/shards?v
Reindex Chef search data
If Elasticsearch is healthy but Chef search still behaves incorrectly, rebuild the Chef search index:
sudo chef-server-ctl reindex
Run this during a controlled maintenance window on larger installations. Reindexing can add load and may take time depending on node, role, cookbook, and data bag volume.
Verify Chef API behavior
Once services recover, validate the Chef API from an admin workstation:
knife status
knife node list
knife search node '*:*' -i
Then confirm at least one test node can complete a converge:
sudo chef-client
Recovery checklist
- Confirm backend disk, memory, and service health.
- Confirm Elasticsearch cluster status is not red.
- Retry shard allocation only after fixing the cause.
- Reindex Chef search if API search behavior is still stale.
- Run
knife status,knife search, and a realchef-clientconverge. - Capture the incident cause so the next recovery is not command archaeology.
Longer-term fixes
If this class of failure repeats, treat it as an infrastructure reliability issue rather than a Chef quirk. Add monitoring for disk watermarks, cluster health, backend service restarts, and failed converges. Test backup restore and failover procedures before the next outage.
Related work
This recovery note supports platform automation and incident recovery work in Selected operational work.