Fixing Chef cluster issues | Alex Budurovici

Chef is a core configuration management tool for many production environments, but HA cluster failures can be difficult to diagnose because operational guidance is sparse.

This note collects commands and recovery steps for a high-availability Chef cluster on AWS. The same patterns apply on other cloud platforms such as DigitalOcean or Google Cloud.

If you are planning a new deployment, review the AWS Native Chef Server Cluster reference architecture first. It was designed and tested for AWS services.

Operational context

When to use this: Chef Server is degraded, nodes cannot converge reliably, or the backend cluster reports unhealthy Elasticsearch or service status.
What it reduces: prolonged configuration-management outage caused by shard allocation, backend service drift, or unclear cluster state.
Tradeoff: these commands are recovery tools, not a replacement for backups, monitoring, and tested failover procedures.
Version scope: the Elasticsearch commands apply to Chef Server deployments that actually use an accessible Elasticsearch backend for search. Newer Chef Automate, externalized search, or different backend topologies may require different commands.

Start with service health

Run the standard Chef status checks before changing the backend:

sudo chef-server-ctl status
sudo chef-server-ctl reconfigure
sudo chef-server-ctl tail

If the installation uses Automate or a split backend/frontend topology, run equivalent status checks on the nodes that host PostgreSQL, Elasticsearch/OpenSearch, RabbitMQ, and Bookshelf.

Look for:

failed or restarting Elasticsearch services
disk pressure on backend nodes
shard allocation warnings
frontend nodes failing to reach backend services
repeated 5xx errors from Chef API endpoints

Check Elasticsearch cluster health

Chef Server historically used Elasticsearch for search. When search is red or shards are unassigned, Chef can appear partially healthy while searches, roles, environments, or node queries fail.

First confirm where the search backend runs and whether it is reachable from the node where you execute these commands. Do not assume Elasticsearch is listening on 127.0.0.1:9200 in every Chef deployment.

curl -s http://127.0.0.1:9200/_cluster/health?pretty
curl -s http://127.0.0.1:9200/_cat/indices?v
curl -s http://127.0.0.1:9200/_cat/shards?v

Important fields:

status: green is healthy, yellow means replicas are missing, red means primary shards are unavailable.
unassigned_shards: any non-zero value needs investigation.
_cat/shards: shows which index and shard failed to allocate.

Investigate unassigned shards

Ask Elasticsearch why a shard is unassigned:

curl -s -XGET "http://127.0.0.1:9200/_cluster/allocation/explain?pretty" \
  -H "Content-Type: application/json" \
  -d '{}'

On some Elasticsearch versions this empty request returns a useful explanation for the first unassigned shard. On others, you may need to pass the index, shard, and primary/replica details from _cat/shards.

Common causes include low disk watermark, missing nodes, incompatible shard copies, or previous allocation failures.

If disk pressure is the cause, free space first. Do not keep forcing allocation while the cluster is over its high watermark.

Retry failed allocation

After fixing the underlying cause, retry failed allocations:

curl -s -XPOST "http://127.0.0.1:9200/_cluster/reroute?retry_failed=true&pretty"

Then re-check health:

curl -s http://127.0.0.1:9200/_cluster/health?pretty
curl -s http://127.0.0.1:9200/_cat/shards?v

Reindex Chef search data

If Elasticsearch is healthy but Chef search still behaves incorrectly, rebuild the Chef search index:

sudo chef-server-ctl reindex

Run this during a controlled maintenance window on larger installations. Reindexing can add load and may take time depending on node, role, cookbook, and data bag volume.

Verify Chef API behavior

Once services recover, validate the Chef API from an admin workstation:

knife status
knife node list
knife search node '*:*' -i

Then confirm at least one test node can complete a converge:

sudo chef-client

Recovery checklist

Confirm backend disk, memory, and service health.
Confirm Elasticsearch cluster status is not red.
Retry shard allocation only after fixing the cause.
Reindex Chef search if API search behavior is still stale.
Run knife status, knife search, and a real chef-client converge.
Capture the incident cause so the next recovery is not command archaeology.

Longer-term fixes

If this class of failure repeats, treat it as an infrastructure reliability issue rather than a Chef quirk. Add monitoring for disk watermarks, cluster health, backend service restarts, and failed converges. Test backup restore and failover procedures before the next outage.

Related work

This recovery note supports platform automation and incident recovery work in Selected operational work.