Cassandra Compaction: Strategies for Storage Management

Cassandra Compaction: Strategies for Storage Management. Cassandra has a distributed architecture, which enables it to handle large volumes of data. It stores this data in immutable files on the disk known as SSTables (Sorted String Tables). This immutability has some benefits such as simple data writes and reduced writes. However, it creates a scenario where you have multiple versions of the same data due to continuous updates and deletions.

Compaction is a process that aims to solve the challenges originating from the way Cassandra handles data writes and storage. Basically, this is the process of merging multiple SSTables into a single SSTable. Compaction not only reduces disk usage but also improves read performance. 

This article discusses the Cassandra compaction, different compaction strategies, and disk management. Read on!

How Cassandra Compaction Works

Compaction is the process of reconciling multiple copies of data stored in different SSTables. By compaction of SSTables, Cassandra maintains fewer copies of each data row, improving read performance.

SSTables play a crucial part in the storage architecture. How the SSTables are designed is integral to the database’s performance and reliability. The process of compaction in Cassandra depends majorly on the behavior and characteristics of these SSTables as well as data modelling patterns

SSTables are immutable. This means that once written to disk, they cannot be modified. Each SSTable contains data sorted by partition keys, making it efficient for read operations. The sorting comes in handy during range queries. SSTables store the actual data in key-value pairs, where keys are partition keys and values are column values.

Initially, all write operations are recorded in an append-only file ensuring data durability. Data is then stored in an in-memory structure called a Memtable. When the Memtable is full, it’s flushed to disk as an SSTable. This process is known as a minor compaction.

As more data is written, multiple small SSTables are created. Cassandra periodically merges these smaller SSTables into larger ones during minor compaction. Occasionally, Cassandra performs a major compaction. This is whereby all SSTables for a column family are merged into a single, compact SSTable. While the comprehensive merging process is more resource intensive, it results in a highly optimized storage structure.

Different Compaction Strategies in Cassandra

  • SizeTieredCompactionStrategy (STCS)
  • LeveledCompactionStrategy (LCS)
  • TimeWindowCompactionStrategy (TWCS)
  • DateTieredCompactionStrategy(DTCS)

Each of these compaction strategies has a different impact on disk usage, performance, and database structure.

SizeTieredCompactionStrategy(STCS)

SizeTieredCompactionStrategy (STCS) is the default method of compaction in Apache Cassandra. It plays a crucial role in managing the efficiency and performance of data storage, particularly in write-intensive environments. At its core, STCS comes into play whenever it detects SSTables of the same size.

Ideally, STCS streamlines the database’s storage by reducing the overall number of SSTables on the disk. In practice, Cassandra organizes these SSTables into ‘buckets’ based on their sizes. Therefore, each bucket contains SSTables within a certain size range. Define these buckets by configurable parameters like bucket_low and bucket_high. This range is relative to the average size of the SSTables already in the bucket. When multiple SSTables in a bucket reach a similar size, they merge together into a larger SSTable.

The SizeTieredCompactionStrategy (STCS) strategy helps minimize the disk space occupied by redundant data. It also enhances the efficiency of read operations by decreasing the number of scanned SSTables. This strategy is ideal in scenarios where the workload is predominantly write oriented, and the disk space is sufficiently provisioned to handle the compaction processes.

LeveledCompactionStrategy (LCS)

LeveledCompactionStrategy(LCS) is a compaction strategy that focuses on a more structured and level based organization of SSTables. This strategy provides a significant shift in how data is managed and compacted, particularly for read-heavy workloads. At its core, LCS divides SSTables into multiple levels, each with a specific size constraint. This hierarchical structure starts with smaller SSTables at the lower levels, with each subsequent level designed to be ten times larger than the previous one.

In LCS, SSTables within each level are non overlapping. For any given level, no two SSTables contain the same partition keys. As a result, when a read request is made, Cassandra limits its search to a single SSTable within each level. This way, the read operation becomes faster and more efficient. When a level becomes full, a new SSTable needs to be added. LCS then compacts the SSTable with any existing overlapping tables in that level. If this process results in an overflow, the process promotes extra SSTables to the next higher level.

LCS is ideal for read heavy workloads. The SSTables within a level are non-overlapping. Therefore, most read operations are satisfied by accessing just one SSTable per level. This structure speeds up read operations and ensures more predictable read performance.

TimeWindowCompactionStrategy (TWCS)

TimeWindowCompactionStrategy (TWCS) is a compaction strategy ideal for time series data or workloads where data has a Time to Live (TTL) set. This strategy is different in how it handles data based on time windows, optimizing the process of data management and compaction in scenarios where data is time sensitive. TWCS groups data into distinct time windows, and compacts data within these windows.

There are 2 primary options that define these windows:

  • compaction_window_unit: Specifies the unit of time (like MINUTES, HOURS, or DAYS) used for the window
  • compaction_window_size: Determines the size of each time window in terms of the number of units

The strategy aims to create a single SSTable for each time window, effectively segmenting data based on its timestamp. During the active time window, TWCS uses the STCS for efficient compaction. This approach allows data within a given time window to be compacted together. It makes it easier to manage and eventually discards when it expires.

DateTieredCompactionStrategy (DTCS)

Next compaction strategy is designed for managing time-series data. However, this strategy has been deprecated in version 3.8 and after. However, if you are using earlier versions of Cassandra, you can still use DTCS.

DTCS is ideal for time series workloads, where the data’s temporal aspect is a primary concern. It groups data written during the same period into the same SSTable. This approach aligns with the nature of time series data, where data points are typically collected in chronological order.

Compaction Tuning Parameters

In Cassandra, compaction helps optimize the performance and efficiency of the database. To tailor the compaction Cassandra provides various configurable parameters. These include:

snapshot_before_compaction

Determines whether Cassandra should take a snapshot of the data before performing any major compaction. When enabled, it creates a full backup of the data in the keyspace before the compaction process starts. This process prevents data loss in case something goes wrong during the compaction.

Enabling this parameter provides a safety net against data corruption during compaction. However, it requires additional disk space to store the snapshot.

concurrent_compactors

Sets the number of compaction processes that run concurrently in a Cassandra node. Increasing the number of concurrent compactions speeds up the compaction process. Especially important for high throughput environments where data is written rapidly.

Since higher concurrency leads to increased I/O and CPU usage, configure this parameter based on the nodes’ hardware capabilities. If the node’s resources are overutilized, it might negatively impact the overall performance of the database.

compaction_throughput_mb_per_sec

This parameter controls the speed at which compaction operations happen, specifically the throughput in megabytes per second. By setting a limit on the compaction throughput, it helps manage the impact of compaction on the disk’s I/O performance. It essentially throttles the compaction to prevent it from consuming too much I/O bandwidth.

The optimal setting depends on the disk I/O capacity of the node and the overall workload. When you set it too low, you have a backlog of compactions, especially in a write-heavy environment. Setting it too high might impact the performance of other operations.

tombstone_compaction_interval

Specifies the interval in days before Cassandra considers a tombstone (a marker for deleted data) for compaction. It helps in managing the deletion of data in a way that balances consistency and performance.

tombstone_threshold

Sets the ratio of tombstones in an SSTable at which point Cassandra will consider the SSTable for compaction. If the proportion of tombstones in an SSTable exceeds this threshold, it marks the SSTable for compaction to purge the tombstones and reclaim disk space.

min_threshold and max_threshold

Determine the minimum and maximum number of SSTables to include in a minor compaction. min_threshold sets the minimum number of SSTables that must be present for a minor compaction to occur. On the other hand, max_threshold is the upper limit of SSTables that’s in a single minor compaction.

min_sstable_size

Used in STCS to specify the minimum size for an SSTable for compaction. Helps in preventing the compaction of very small SSTables, which can be inefficient.

Cassandra Compaction: Strategies for Storage Management Conclusion

Compaction is crucial to achieving efficient storage management and optimal database performance in Cassandra. Using the above strategies, you effectively manage data redundancy, enhance read/write speeds, and ensure disk space is utilized optimally. Besides, Cassandra provides various parameters that you use to  finely tune the compaction process. These parameters provide a balanced approach to data management. This ensures that the compaction process is efficient without adversely impacting the database’s overall performance.

Avatar for Dennis Muvaa
Dennis Muvaa

Dennis is an expert content writer and SEO strategist in cloud technologies such as AWS, Azure, and GCP. He's also experienced in cybersecurity, big data, and AI.

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x