Optimizing ETCD Performance: Compaction, Defragmentation, and Tuning in OpenShift (4.16)

What static factors influence the load on etcd? (2 examples)

Number of nodes
Number of pods

What dynamic factors influence the load on etcd? (3 examples)

Changes in endpoints (pod scaling, HPA..)
Pod restarts
Job executions

Does etcd maintain historical key values?

Yes. Etcd stores historical key values until compaction is performed.

What does history compaction in etcd do?

Removes old key versions

What is the effect of etcd history compaction?

Reduces database size
Improves performance

Does history compaction return reclaimed space to the filesystem?

No. It only marks the space as unused. Therefore, defragmentation is necessary.

Does OpenShift automatically perform compaction of etcd?

Etcd is compacted every 5 minutes

Why is defragmentation of etcd important?

Improves performance

What does defragmentation in etcd do?

Reclaims free space in the etcd database
Makes the space available to the filesystem

Does OpenShift automatically perform defragmentation of etcd?

Yes - etcd operator

Can defragmentation of etcd be performed manually?

Yes, using the command etcdctl defrag

Does OpenShift have mechanisms suggesting when defragmentation of etcd should be performed?

Yes - AlertManager/Prometheus checks certain utilization correlations with the etcd database size.

Can the settings for automatic defragmentation be changed?

No.
https://access.redhat.com/solutions/5564771

https://github.com/openshift/cluster-etcd-operator/blob/release-4.16/pkg/operator/defragcontroller/defragcontroller.go#L28C1-L28C6

Can defragmentation impact cluster operation?

Yes - it can lead to API interruptions.

Can automatic defragmentation of the etcd database be disabled?

Yes - by creating a empty config map:
oc create configmap etcd-disable-defrag -n openshift-etcd-operator

https://access.redhat.com/solutions/6960380

What can be done to ensure that etcd database defragmentation does not impact cluster stability?

Regular reboots can be performed.

During the node startup, the etcd operator may detect the need for defragmentation and execute it. This occurs at a stage when the node is not yet providing business workloads

For example, if a customer regularly updates OpenShift to newer .z versions (x.y.z), this process occurs transparently.

Why is defragmentation necessary even when using SSD/NVMe disks (which don’t have seek-time issues)?

Defragmentation is done at the database level, not the filesystem. A smaller database has better performance due to better data structure management.

What should be monitored in the context of etcd performance?

Fsync time for WAL (writes from RAM to disk)
Number of leader changes
Network latency between etcd members

What tuning parameters are key for etcd?

etcd_database_size
hardware_speed_tolerance

Where are tuning parameters for etcd set?

oc edit etcd/cluster

What is hardware speed tolerance in etcd?

A parameter defining tolerance for hardware delays during data writes.

When should hardware speed tolerance be considered for modification?

High write latency on disk
Performance issues with etcd

Why modify hardware speed tolerance?

To adjust etcd to operate in environments with slower or unstable disks. It changes timeouts values for example.

In which environments may hardware_speed_tolerance need to be set to Slow?

Installations with network disks, e.g., iSCSI
Frequently on virtualization

Why increase the etcd database size?

To store more data (for large clusters)

In what phase is the etcd database expansion functionality?

Tech-preview

How much RAM must the master node have in relation to the etcd database size?

Recommendation is at least 3 times the size of the etcd database.

Why does a larger etcd database affect master node RAM requirements?

The API Server Cache will need to cache more elements.

What can a slow disk used by etcd affect?

Performance
Stability

What is critical for the performance of etcd?

Low disk latency (SSD or NVMe)

What can happen with high disk latency?

Leader loss
Timeouts
Slower OpenShift API operations
Impact on all cluster applications

What can cause high etcd latencies?

Other processes with intensive I/O
High network latency

Can OpenShift have a dedicated disk for etcd?

Yes

Where are procedures and recommendations about etcd for OpenShift is available?

https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/scalability_and_performance/recommended-performance-and-scalability-practices#recommended-etcd-practices

Search This Blog

techQnA.io

Optimizing ETCD Performance: Compaction, Defragmentation, and Tuning in OpenShift (4.16)

Comments

Post a Comment

Popular Posts

RHEL AI - Key Features and Components of RHEL AI

Web Terminal Operator: Tips & Tricks for Managing Terminals in Red Hat OpenShift (4.16)