Top 15 Apache Cassandra Best Practices Checklist

Top 15 Apache Cassandra Best Practices Checklist. There has been some issues reported around insecurity in NoSQL databases – Cassandra, Elastic, Mongo. Hence the reason of the article is to outline Cassandra Best Practices in order to secure your Apache Cassandra clusters. In my experience most vulnerabilities are due to deploying and managing  Cassandra rather than inherent security bug in the software. Lets start!

What is Apache Cassandra

Apache Cassandra is a non relational, open source distributed database that provides continuous availability, data distribution and scalability across different cloud providers, availability zones and data centers.

It delivers a highly reliable data storage engine for different applications that require immense scale.

You can create well organized, high performance, complete Cassandra clusters using Cassandra data analysis and modeling. But you need to align with the basic tuning checklist to ensure that the cluster is up and running with no early hiccups.

Look at the below 15 Cassandra checklist to meet your desired goals.

Cassandra Best Practices Checklist

1. Don’t use Cassandra like a relational database

You should not design the Cassandra data model like a relational database. You can define the plan to access data tables at the beginning of the data modeling process and not at the end.

You should merge tables into a denormalized table to extract data from more than one table because tables can’t be joined in Cassandra.

In Cassandra’s table design, denormalization is the key because Cassandra does not support joins or derived tables. Also, it’s important to design to optimize how data can be distributed around the cluster.

When it comes to sorting in Cassandra, it can only be done on the clustering column in the Primary Key.

2. There are no load balancers in front of Cassandra

Cassandra distributes its data across different nodes, and Cassandra drivers can direct requests precisely because of the built-in algorithm drivers.

Adding a load balancer introduces an additional layer that breaks intelligent algorithms used by the driver to introduce a single point of failure. But there should be no point of failure in the Cassandra world.

3. You should avoid secondary index

In Cassandra, the secondary index is local and using a secondary index can cause performance trouble while accessing multiple cluster nodes. But if you want to use the secondary index, you can use it occasionally on a low cardinality column.

Don’t use it for high cardinality columns, and try to avoid using a secondary index generally.

4. Bypass full table scans

If you want to perform an entire table scan, it can cause extreme heap pressure because Cassandra distribution of partitions is done among all nodes in the cluster.

If it’s a large cluster, it has billions of rows that can cause full table scan trouble. You can tweak the data model to avoid a full table scan for better performances and no bottlenecks.

5. Limit the partition sizes to 100 MB

You should keep the size of the partitions within 100 MB to ensure a streamlined heap and smooth compaction. The maximum practical limit for partitions is two billion cells, so converting large partitions creates more pressure on the heaps and slows compaction.

Limiting the partition sizes can enhance the cluster’s performance and deliver optimized and better results.

6. Avoid using batch for bulk loading

You should avoid using batch for bulk loading when multiple partition keys are involved in the process. It can put significant pressure on different coordinator nodes and decrease the performance of the results.

You can use a batch when you need to keep a denormalized set of tables. So try to avoid bulk loading with the batches.

7. Design your model

Cassandra is a distributed data system that distributes incoming data into partitions. Cassandra can help you group data into partitions by hashing a data attribute known as partition key and channeling the partitions into the nodes in the cluster.

You can create a decent Cassandra model that limits the partition size, distributes data across cluster nodes, and minimizes the partitions returned from the query.

You need to ensure that your Cassandra model design follows these patterns and helps you achieve your desired results with finesse.

8. Evenly distribution across the Cassandra cluster

You must choose a partition key with high cardinality to bypass hot spots where a few nodes are under high pressure and others are idle.

You need to ensure symmetrical distribution of partition keys to minimize the pressure on the nodes.

You can choose the partition keys with the number of possible values bound for increased performance. You can also keep the partition key size between 10 to 100 MB, as discussed in one of the above listed practices of using Cassandra.

Also, minimize the number of partitions read by a single query because reading more patients can get expensive as each partition may reside on separate nodes.

The coordinator will issue separate commands to separate nodes from different partitions requested. It will add overhead and boost the variation in latency.

9. Importance of the primary key

The tables in Cassandra have a set of columns termed the primary key. The primary key also shapes the data structure and determines the uniqueness of the row.

The primary key has two parts: the partition key and the clustering key or the clustering column. The set of columns or first column in the primary key is the partition key and has great importance.

If we talk about clustering keys or clustering columns, they are the columns next to the partition key. They are optional and not required compared to the partition key. The clustering key finalizes the default sort order of the row within the partitions.

You need to ensure in the design process to make the partition key distribute data to different nodes of the cluster and avoid keys with a small domain of possible values like school grades, status, gender, and others.

The minimum value should be higher compared to the cluster nodes and avoid using keys that have highly skewed values.

10. Focus on taking advantage of the prepared statements

Use the prepared states when you can during query execution with a similar structure multiple times. Cassandra will cache the results and parse the query string.

You can bind different variables with the cached prepared statements if you want the query on the next occasion. It can increase the performance by bypassing the parsing phase for each query.

11. Bypass using IN clause queries with a greater number of values

Using the IN clause query with greater numbers for different partitions puts significant pressure on the coordinating node that can minimize the node’s performance. If the coordinator note fails in processing the query because of excessive load, you need to retry the entire thing.

During the excessive load, you can use separate queries, bypass single failure points and implement more pressure on the coordinator node.

12. Choose leveled compaction strategy for the heavy workload

Leveled compaction strategy can ensure 90% of the reads are done from a single sorted strings table or SStable but when the rows are uniform. It’s great to read latency sensitive and heavy use cases that cause more compaction and require more i/o during compaction.

It’s great to use a leveled compaction strategy during the table creation because once the table is created, it can become tricky for you to change the approach later.

It can be changed later, but one mistake can overload the node with too much i/o.

13. Limit the number of tables

You need to limit the tables in the cluster to avoid excessive memory pressure and heap that can minimize the performance. A large number of tables beyond the reasonable limit can result in overheating.

It’s hard to find the right number for the tables to be created, but multiple tests have been performed, and the results prove that you should focus on limiting the number of tables to 200 and avoid crossing 500 to initiate the failure levels.

14. Use local consistency levels

You should use local consistency levels in the multi datacenter environment to prepare a response that can avoid the latency of inter datacenter communication.

It will not be possible to use local consistency levels with every use case, but if possible or if the use case permits, you should prefer local consistency levels in multi datacenter environments.

15. Avoid queue data models

You should avoid creating queue like data models because they can generate many tombstones. The slice query is sub optimal, which is used to scan through the tombstones for filtering a match.

It can cause an increase in heap pressure and latency because it scans through garbage data to spot a small amount of data that can be utilized.

Great effort! We have learned Top 15 Apache Cassandra Best Practices Checklist.

Top 15 Cassandra Best Practices Checklist Conclusion

Now that you know the best checklist for handling the Cassandra cluster and bypassing different difficulties, it’s time to update your checklist and add the missing points. These issues can help you increase your efficiency and help you achieve better results managing Cassandra.

Avatar for Hitesh Jethva
Hitesh Jethva

I am a fan of open source technology and have more than 10 years of experience working with Linux and Open Source technologies. I am one of the Linux technical writers for Cloud Infrastructure Services.

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x