Cassandra Architecture with Diagram – Components of Cassandra. In this tutorial we will go thorough overview of the Cassandra architecture and learn Cassandra components. The Apache Cassandra architecture is designed to provide scalability, availability, and reliability to store massive amounts of data. Cassandra was designed to perfect architecture requirements.
What is Cassandra?
Apache Cassandra is a column oriented distributed database designed to store and handle big data across multiple nodes without any single point of failure. The reason to design this form of Cassandra’s architecture was to save your data from a hardware failure that might occur at any time. There is no guarantee as any node may go down due to technical issues. Thus, Cassandra was designed with a distributed architecture to deliver continuous delivery and performance.
If a node was to fail, the Read and Write requests can be served in the network with the help of other nodes. In this guide, we will discuss some of the essential components of Cassandra, replication factor, strategies (Simple Strategy and Network topology strategy) used in Cassandra and the write and read operations in detail.
In this part of tutorial of Cassandra Architecture with Diagram is vital to understand that Apache Cassandra is fault tolerant, highly scalable, reliable, secure and easy data distribution database. It comprises various components that play a key role in the smooth functioning of the distributed database. Here are Cassandra’s main components:
Node: Node is the location where all the data and files are stored.
Data center: A data center is a group or collection of nodes. Together these nodes are categorized as data centers.
Cassandra Cluster: Cluster is another component of Cassandra. It is a group or collection of one or more data centers.
Commit log: All the write operations are written to the commit log in Cassandra, used for crash recovery.
Mem table: In Cassandra, the mem table has a memory resident data structure. Once the data is written to the commit log, it is written to the mem table on a temporary basis.
SSTable: is a disk file to which all the data from the mem table flushes as it reaches a certain threshold value.
Bloom filter: is a cache used for testing after every query. Bloom filters are quick, nondeterministic, algorithms in Cassandra.
CQL Table: Cassandra Query Language (CQL) is a collection of of ordered columns in a table. Each table comprises columns and primary key.
Gossip Protocol: These are communication protocols that aid in discovering, sharing location and data about the different nodes present in the cluster.
Used as a backup plan for conditions when a link goes down or a hardware problem occurs during the data process. The Data Replication strategy in Cassandra uses replication to ensure high availability and no single point of failure. Cassandra places an exact copy or model of each data item on different nodes on the basis of these two factors:
Replication Strategy determines the place for the next replica.
The replication Factor determines the total sum of replicas to be placed for each node.
For example, a single replication factor implies that there exists a single copy of data in a node. However, three replication factor implies that three copies of the data are available on three different nodes.
The replication factor must be three to ensure there is no single point of failure and the performance is continuous. Each node communicates with the other using the Gossip protocol in Cassandra.
Have a look at the two types of replication strategies used in Cassandra.
1. Simple Strategy in Cassandra
Mostly used when you have only one data center. As per the simple strategy, the practitioner selects the node to place the first replica. After placing the first one, replicas are placed in the clockwise direction to the remaining nodes in Cassandra.
Look at the pictorial representation of the Simple Strategy in Cassandra.
2. Network topology strategy in Cassandra
This strategy is used when you have two or more data centers. As per the Network topology strategy, separate replicas are placed for each data center in the clockwise direction in the ring.
According to the strategy, replicas are placed in each node till they reach the first node of another rack. Network topology strategy ensures placing replicas on other racks in the same data center.
The purpose is to ensure that if a rack fails or a problem occurs, the other nodes on the replica can serve the data.
Look at the pictorial representation of the Network Topology strategy in Cassandra.
Users access Cassandra through the nodes by applying Cassandra Query Language (CQL). CQL uses the database as a container of tables. Programmers use cqlsh, which is a prompt to work with CQL or separate application language drivers.
Cassandra Write Operation
Replicas receive the writing request from the coordinators regardless of their consistency level. Consistency level decides how many nodes in a cluster will revert back acknowledging success, i.e., the data is successfully written to the commit log.
The commit logs capture all the writing operations of nodes which are later stored in the mem table on a temporary basis. The commit log maintains transaction records for backup. As the mem table reaches its threshold value, all the data is moved and written into the SStable data file. Cassandra consolidates the SStable data file on a timely basis and eliminates unnecessary data.
An example: a single data with a replication factor as three will receive a write request for three replicas. Now, if their consistency level is one, then only one replica will revert with the success acknowledgment and the other will stay dormant.
Further, let us consider the other two remaining replicas lose data due to node failure or other issues, the Cassandra will use the built in repair mechanism to make the row consistent in Cassandra.
In short, this is how the write process takes place in Cassandra:
As the node receives the write request, it writes to the commit log that maintains transaction records for backup.
In the next step, the commit log writes and store data to the mem table temporarily.
The minute mem table gets full, all the data is flushed into the SSTable data file.
Read Operation in Cassandra
For read operations, Cassandra receives all the values from the mem table. It also uses the bloom filter to discover SSTable holding the required information. There are three kinds of read requests in Cassandra sent by the coordinator to the replicas:
Direct request.
Digest request.
Read repair request.
The direct request is sent by the coordinator to one of the replicas in the node in Cassandra. Later, the digest request is forwarded to the number of replicas by the coordinator as per the consistency level.
Once the digest requests are sent to all the remaining replicas by the coordinator, it checks if the node’s returned value is an updated value or has an out of date value. If there is an out of date value, the read repair request in the background will update the data. This whole process is referred to as a read repair mechanism.
Cassandra Architecture with Diagram – Components of Cassandra Conclusion
Apache Cassandra is an open source distributed database that helps store and manage large data volumes across different data centers. It has a peer to peer architecture that aids programmers to fast write large stored data without affecting the read efficiency. In this article we have explained Cassandra’s internal architecture and how it replicates, writes and reads requests.
Further, we have discussed how the consistency level is maintained by Cassandra throughout the process. Also, in Cassandra’s architecture, how the authorized user can connect to the data centre nodes and access data with the help of CQL language. Node, Data Center, Cluster, Commit Log, Mem table, SSTable, Bloom filter, CQL table, Gossip Protocol are a few main components of Cassandra.
Cassandra offers consistency and helps eliminate downtime and data loss with its features like no single point of failure, high availability, fast linear scale performance, decentralized deployments, audit logging, linear scalability, ACID support, etc. In Cassandra’s Architecture, a Replication factor signifies that only a single copy of data exists in each note, while three replication factors imply three copies of data for three different nodes.
Further, we have highlighted the strategies in Cassandra, Simple Strategy (used for one data center) and Network topology strategy (used for two or more data centers). Also, we have given a brief on how the read and write operations in Cassandra work. Cassandra is highly suitable for real time and big data workloads, online gaming, time series data management, media streaming management and real time data analytics.
I am a fan of open source technology and have more than 10 years of experience working with Linux and Open Source technologies. I am one of the Linux technical writers for Cloud Infrastructure Services.