How to troubleshoot errors in etcd
When etcd exceeds the configured amount of storage space, it can be hard to track down and repair the issue. The following guide illustrates some troubleshooting tools. It is tailored to the following error, but the tools may prove useful in solving other problems as well:
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: etcdserver: mvcc: database space exceeded
Command overview
The standard command for interacting with etcd is etcdctl
. This command requires a number of settings or options to
find and authenticate with the etcd nodes.
In a Deephaven environment, the etcdctl.sh
script, located in /usr/illumon/latest/bin
, will automatically set most of
these options. The command requires the location of the configuration files for connecting.
These are located in /etc/sysconfig/illumon.d/etcd/client
. The rest of this document uses the root
configuration, but
others are available that do not require root permission.
Note
/etc/sysconfig/illumon.d/etcd/client/root
represents the etcd root user, and may not be the operating system root user. If appropriate, use sudo -u irisadmin
(or whatever user owns the files) instead of sudo
in the commands below.
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh <etcdctl command options>
This command will print a table of your etcd configuration:
sudo -u irisadmin DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/schema /usr/illumon/latest/bin/etcdctl.sh endpoint status --write-out table
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.128.13.53:2379 | 20f1fe672cdca01d | 3.3.18 | 17 MB | true | 23854 | 48441 |
| https://10.128.13.54:2379 | 8cdf5ce8a296848f | 3.3.18 | 17 MB | false | 23854 | 48442 |
| https://10.128.13.55:2379 | 4ea3e72f6e028887 | 3.3.18 | 17 MB | false | 23854 | 48443 |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
Note
The output formats specified by --write-out
render the node ID differently. You might need to use different options to get the node ID in the format needed in other commands.
For example, the id fields are longs in JSON format:
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh endpoint status -w json
[{"Endpoint":"https://10.128.13.53:2379","Status":{"header":{"cluster_id":9262865042670673362,"member_id":2373958197688705053,"revision":13864,"raft_term":23854},"version":"3.3.18","dbSize":17051648,"leader":2373958197688705053,"raftIndex":48449,"raftTerm":23854}},{"Endpoint":"https://10.128.13.54:2379","Status":{"header":{"cluster_id":9262865042670673362,"member_id":10150934239346328719,"revision":13864,"raft_term":23854},"version":"3.3.18","dbSize":17022976,"leader":2373958197688705053,"raftIndex":48450,"raftTerm":23854}},{"Endpoint":"https://10.128.13.55:2379","Status":{"header":{"cluster_id":9262865042670673362,"member_id":5666626947057354887,"revision":13864,"raft_term":23854},"version":"3.3.18","dbSize":17031168,"leader":2373958197688705053,"raftIndex":48451,"raftTerm":23854}}]
Investigate the condition
If one or more nodes of the etcd cluster are out of space, you should get something like this with alarm list
:
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh alarm list
memberID:3254910096547498518 alarm:NOSPACE
memberID:13807399138998277405 alarm:NOSPACE
memberID:13160873893432754734 alarm:NOSPACE
Clearing the error condition
You can clear the alarms with alarm disarm
:
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh alarm disarm
The alarm will likely return unless you address the cause of the alarm with the steps below.
Check compaction settings
Every change in etcd creates a new revision, which can be used to retrieve prior key values. This history needs to be compacted periodically to keep storage space from increasing without limit.
The default configuration file has these lines:
auto-compaction-mode: periodic
auto-compaction-retention: "168"
This means that etcd will automatically compact every hour (implied by the periodic mode), and it will remove all versions older than 168 hours (1 week). If your system exceeds database space frequently, this time period can be shortened, or the mode can be changed.
The default configuration file is /etc/etcd/dh/latest/config.yaml
on the nodes running etcd.
Note that there are several config files, and this is a symbolic link to one of them.
You will need to edit all the configuration files, distribute them to all etcd nodes, and restart the etcd processes
to make changes effective.
Compact now
You can compact history immediately instead of waiting for the periodic compaction.
Find the current revision:
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh endpoint status -w fields
"ClusterID" : 9262865042670673362
"MemberID" : 2373958197688705053
"Revision" : 13864
"RaftTerm" : 23854
"Version" : "3.3.18"
"DBSize" : 17051648
"Leader" : 2373958197688705053
"RaftIndex" : 48477
"RaftTerm" : 23854
"Endpoint" : "https://10.128.13.53:2379"
...
Find the value of the "Revision" field and fill it in below.
Compact away all old revisions
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh compact 1516
compacted revision 1516
Defragment away excessive space
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh defrag
Finished defragmenting etcd member[https://10.128.13.53:2379]
Finished defragmenting etcd member[https://10.128.13.54:2379]
Finished defragmenting etcd member[https://10.128.13.55:2379]
Disarm alarm
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh alarm disarm
Verify the system accepts changes again
sudo DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root /usr/illumon/latest/bin/etcdctl.sh check perf
This command makes changes, and you should see the revision number increase. You can also verify a change with any command that makes changes to etcd.
Increase the maximum database size
The default maximum size is 2 Gb. You can increase this by adding a setting to the configuration file:
quota-backend-bytes: 8589934592
They recommend 8 Gb as a maximum, but larger values are supported.
The current setting for this value is in metrics etcd publishes.
Use the etcdctl commands above to find your addresses, then you can use curl to get metrics:
curl -k https://10.200.46.148:2379/metrics | grep etcd_server_quota_backend_bytes