High read latency in Cassandra and increasing load issue

High read latency in Cassandra and increasing load issue
Photo by Ricardo Gomez Angel / Unsplash

In our product, we use a high-performant distributed database named Cassandra. It is open-source software, but using it is not cost-free. Running its own clusters requires great expertise and regular maintenance.

In one of the environments, the 6-node Cassandra cluster was deployed. In the beginning, nothing was worrying about the cluster. After some time, it turned out that disk space usage was increasing. The storage was resized a couple of times, from 500GB to 1TB, then from 1TB to 2TB, and finally from 2TB to 4TB. But this resizing wasn’t a remedy at all as the cluster started lagging. It was only delaying the upcoming catastrophe. So we had to figure out what was really going on.

Tracking down the root cause

❯ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  192.168.1.115  2.36 TB    256     ?       f2f370d6-c829-495d-9b1a-499e799f1916  rack1
UN  192.168.1.53   1.92 TB    256     ?       921e6c39-7f02-4114-8a6a-d8cd5afbd8b8  rack1
UN  192.168.1.52   1.24 TB    256     ?       4e9fdf72-4154-4ff2-a1d3-6413d815ec37  rack1
UN  192.168.1.114  1.29 TB    256     ?       65ceedce-ed5e-4730-b1b9-51c988682a55  rack1
UN  192.168.1.51   2.14 TB    256     ?       a3d5f6de-e91d-4778-bcb2-086133c7a81b  rack1
UN  192.168.1.116  2.31 TB    256     ?       e28ed4f6-288b-4ceb-b737-70c70860245e  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Printing nodetool status showed that some nodes had already been beyond 2TB and were choking. So it was only a matter of time before they reached 4TB.

❯ nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0      817585641         0                 0
ReadStage                         3         0       83673866         0                 0
RequestResponseStage              0         0      629340424         0                 0
ReadRepairStage                   0         0       11325221         0                 0
CounterMutationStage              0         0              0         0                 0
MiscStage                         0         0              0         0                 0
HintedHandoff                     0         0            300         0                 0
GossipStage                       0         0        3163813         0                 0
CacheCleanupExecutor              0         0              0         0                 0
InternalResponseStage             0         0              0         0                 0
CommitLogArchiver                 0         0              0         0                 0
CompactionExecutor                2        39         123353         0                 0
ValidationExecutor                0         0              0         0                 0
MigrationStage                    0         0              0         0                 0
AntiEntropyStage                  0         0              0         0                 0
PendingRangeCalculator            0         0             13         0                 0
Sampler                           0         0              0         0                 0
MemtableFlushWriter               0         0          70928         0                 0
MemtablePostFlush                 0         0         111004         0                 0
MemtableReclaimMemory             0         0          70928         0                 0
Native-Transport-Requests         2         0     1197812673         0           2561165

Message type           Dropped
READ                    470097
RANGE_SLICE                 21
_TRACE                       0
MUTATION               7627446
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE         27253
PAGED_RANGE                837
READ_REPAIR              11969

That choking could be seen when looking at nodetool tpstats. A lot of Native-Transport-Requests had been throttled.

The next step was to look closely at keyspaces and corresponding column families known simply as tables. Issuing nodetool cfstats gave an insight.

❯ nodetool cfstats
...
Keyspace: data
	Read Count: 14817739
	Read Latency: 265.7262566112144 ms.
	Write Count: 270622271
	Write Latency: 0.9290800410842758 ms.
	Pending Flushes: 0

c

Read latency in data keyspace was pretty odd as reads were taking 265ms.

❯ nodetool cfstats
Keyspace: data
	Read Count: 14817739
	Read Latency: 265.7262566112144 ms.
	Write Count: 270622271
	Write Latency: 0.9290800410842758 ms.
	Pending Flushes: 0
...
		Table: results
		SSTable count: 3358
		Space used (live): 2066615188640
		Space used (total): 2066615188640
		Space used by snapshots (total): 0
		Off heap memory used (total): 2296918130
		SSTable Compression Ratio: 0.14971098992452106
		Number of keys (estimate): 8655588
		Memtable cell count: 28102
		Memtable data size: 245565603
		Memtable off heap memory used: 239301550
		Memtable switch count: 9079
		Local read count: 14817739
		Local read latency: 265.727 ms
		Local write count: 180389361
		Local write latency: 1.142 ms
		Pending flushes: 0
		Bloom filter false positives: 899066
		Bloom filter false ratio: 0.00660
		Bloom filter space used: 202764152
		Bloom filter off heap memory used: 202737288
		Index summary off heap memory used: 26621412
		Compression metadata off heap memory used: 1828257880
		Compacted partition minimum bytes: 125
		Compacted partition maximum bytes: 43388628
		Compacted partition mean bytes: 98167
		Average live cells per slice (last five minutes): 1.9702823082522913
		Maximum live cells per slice (last five minutes): 2.0
		Average tombstones per slice (last five minutes): 1.3038426442792657E-4
		Maximum tombstones per slice (last five minutes): 2.0

Looking further one table was really worth a closer look at the results table. This table had 3358 SSTables, quite a large number!

❯ nodetool cfhistograms data results
EstimatedHistogram overflowed larger than 25109160, unable to calculate percentiles
EstimatedHistogram overflowed larger than 25109160, unable to calculate percentiles
data/results histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%            42.00               NaN               NaN             42510                 8
75%            50.00               NaN               NaN            105778                12
95%            60.00               NaN               NaN            315852               103
98%            60.00               NaN               NaN            545791               258
99%            60.00               NaN               NaN            785939               446
Min             0.00              0.00            180.00               125                 2
Max           258.009223372036854776000.009223372036854776000.00          43388628             14237

c

Printing nodetool cfhistograms of results table had shown that it was required to read 60 SSTables to return data. So one query towards Cassandra triggered the reading of 60 files from the disk! And this impacted performance heavily.

This pointed us towards the compaction process of Cassandra and looking into nodetool compactionstats.

❯ nodetool compationstats
pending tasks: 86
   compaction type        keyspace                                              table       completed           total    unit   progress
        Compaction             log                                          commitlog       624030592      5659451776   bytes     11.03%
        Compaction            data                                            results   2920702929876   4813653709033   bytes     60.68%
Active compaction remaining time :  31h25m28s

Statistics have shown us that it is required 31 hours to finish the compaction process of the results table. So single compaction of the table would have been taking around 77 hours! So the compaction process was quite inefficient.

The next thing to check was virtual memory statistics to find out the I/O blocks throughput.

❯ vmstat 10 1000
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  2 13699692 301824  38448 14898784   14   11 31540  2240    0    0 17  6 57 20  0
 5  9 13715456 281188  35816 14990932   11 1586 668936 54885 138226 104549 19  9 41 31  0
 4  1 13725484 289156  36064 14992604   14 1025 513021 30527 114013 104072 15  7 63 15  0
 3  3 13742652 292892  33252 15011716   17 1696 718570 58119 140681 105421 18 10 38 35  0
 3  1 13754044 294876  33308 15021608   11 1182 564726 42929 122446 101715 18  8 50 25  0
13 13 13764612 291908  33516 15029748   11 1019 541156 43989 123884 102516 18  8 48 27  0
 7 13 13774820 293080  35236 15035136  133 1279 758554 36706 125018 99450 19  9 40 31  0
 8  5 13785004 303612  35360 15013740   16  940 548025 48032 125423 99228 18  8 36 38  0
 2  6 13792172 294076  33080 15061304   11  743 466678 39152 113678 98034 16  7 30 47  0
 6  7 13800876 294300  33996 15061108   19  836 495162 52322 120387 100311 18  8 42 33  0
 4  1 13812444 299568  30396 15077220   16 1189 564618 51925 139990 103598 21  9 48 22  0

Taking the maximum from blocks out, i.e. 58119 blocks, and assuming 1024 bytes per block, gave us approximate throughput of 53MB per second.

The Remedy

Cassandra has compaction_throughput_mb_per_sec configuration parameter that throttles compaction.

The cassandra.yaml configuration file
The cassandra.yaml file is the main configuration file for Cassandra.
compaction_throughput_mb_per_sec (Default: 16) Throttles compaction to the specified total throughput across the node. The faster you insert data, the faster you need to compact in order to keep the SSTable count down. The recommended value is 16 to 32 times the rate of write throughput (in Mb/second). Setting the value to 0 disables compaction throttling.

It turned out that the root cause of the issue was the parameter being set to the default value.

The remedy to the issue was to increase compaction_throughput_mb_per_sec to 196.

concurrent_compactors (Default: Smaller of number of disks or number of cores, with a minimum of 2 and a maximum of 8 per CPU core)note Sets the number of concurrent compaction processes allowed to run simultaneously on a node, not including validation compactions for anti-entropy repair. Simultaneous compactions help preserve read performance in a mixed read-write workload by mitigating the tendency of small SSTables to accumulate during a single long-running compaction. If your data directories are backed by SSD, increase this value to the number of cores. If compaction running too slowly or too fast, adjust compaction_throughput_mb_per_sec first.

At the same time, we increased the number of concurrent_compactors from 2 to 6.

It took a while for the cluster to go back to its normal state, but finally, the cluster stopped lagging and the disk usage dropped down to around 500GB on each node!