Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. The docker image used for this tutorial is cassandra.
Key components of Cassandra are used in this tutorial:
The tutorial uses containerized computing environments. We test with Linux-based environments.
This instructions assume that you have docker and docker compose installed in your machine. Using docker pull we will get the cassandra image and use it.
Inside the cassandra docker image, you can find cqlsh. You can install a version of cqlsh in the local machine using the provided pyproject.toml with rye:
$rye sync
$source source .venv/bin/activate
In this example, we use the OSF water open dataset. You can download that dataset and use any subset of the dataset for testing.
There is a very good set of exercises for basic data tasks. You can use it to practice your work.
We basically create a Cassandra system different nodes (machines):
- cluster name: tutorials
- Node 1: cassandra1 in data center "DC1"
- Node 2: cassandra2 in data center "DC1"
- Node 3: cassandra3 in data center "DC2"
The configuration of the cluster is given in the apache-compose.yml file.
- Start the containers by running
$ docker-compose -f apache-compose.yml up - To see the containers running and get their names, open another terminal and run
$ docker-compose ps
The docker names should be basiccassandra-cassandra1-1,basiccassandra-cassandra2-1 and basiccassandra-cassandra3-1
Having obtained the container names from step 2 of the previous section, let us use them to enter into our Cassandra nodes.
- For each container, log into the cassandra node
$ docker ps $ docker exec -it <container_name> /bin/bash - Let's inspect our cluster using nodetool status. Use nodetool to check Cassandra nodes in different data centers.
$ nodetool status
which shows some information like:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.23.0.4 119.75 KiB 16 59.0% 6164ee92-17ce-40f8-9c0f-4ec5f9f434bf rack1
UN 172.23.0.2 119.82 KiB 16 71.3% 2db8b1b7-8807-4dea-94ae-e4380470f026 rack1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.23.0.3 114.69 KiB 16 69.7% 63e34913-dec5-43f7-a928-260b1076ea86 rack1
- In order to interact with our database, let us enter the cqlsh shell command
$ cqlsh
Notes:
In this simple test: the cassandra system can be accessed without username/password You can also install a separate cqlsh and use the cqlsh outside the container, such as install python package for cqlsh:
pip install cqlsh(or as mentioned before), and runningcqlsh [host] -u [username] -p [password].
Let us assume that we use the water data as input data and we want to manage the data.
Some examples of commands for cqlsh is avaiable under sample water scripts
- Create a keyspace called tutorial12345 with a replication factor of 3
cqlsh>CREATE KEYSPACE tutorial12345 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
Here you should see the replication_factor set to 3. It means that given a data item, a row in a table, the data item will be replicated into 3 nodes. Consider our test with a system of three nodes, it means a data item will be available at all nodes.
-
Let's create a table named water1234 inside this keyspace
cqlsh>CREATE TABLE tutorial12345.water1234 ( timestamp timestamp, -- the time of data record city text, zip text, egridregion text, temperaturef int, humidity int, data_availability_weather int, wetbulbtemperaturef float, coal float, hybrid float, naturalgas float, nuclear float, other float, petroleum float, solar float, wind float, data_availability_energy float, onsitewuefixedapproach float, onsitewuefixedcoldwater float, offsitewue float, PRIMARY KEY ((city,zip), timestamp)); -
We now need to insert some values into our database
cqlsh>INSERT INTO tutorial12345.water1234
(timestamp, city, zip, egridregion, temperaturef, humidity, data_availability_weather, wetbulbtemperaturef, coal, hybrid, naturalgas,nuclear, other, petroleum, solar, wind, data_availability_energy,onsitewuefixedapproach, onsitewuefixedcoldwater,offsitewue) values(
'2019-01-01 00:00:00','Austin','78704','ERCT',42,76,1,37.9783997397202,51725,3010,36030,17927,1086,0,205,33383,1,1.34616023459296,0,1.6258769164237);
You can also load the csv file sample as:
cqlsh>COPY tutorial12345.water1234
(timestamp, city, zip, egridregion, temperaturef, humidity, data_availability_weather, wetbulbtemperaturef, coal, hybrid, naturalgas,nuclear, other, petroleum, solar, wind, data_availability_energy,onsitewuefixedapproach, onsitewuefixedcoldwater,offsitewue)
FROM 'datasamples/water_dataset_v_05.14.24_1000.csv'
WITH HEADER = true AND DELIMITER = ',';
- We can the retrieve the data from the database by sending the query
cqlsh> select * from tutorial12345.water1234 ;
You should get something like:

Note that usually we dont do "select *". We use this as we put only a few record of data as an example.
- Let's do the same query on another node. The result of the query should be as shown in the previous step.
- To simulate node failures or outages. Let us stop two nodes. Run the following command twice each time giving the name of the node that you want to terminate:
$ docker stop <container_name> - Run the following command to get the name of the running container
$ docker-compose ps - Try to query again and see the result/error.
Remember that we have replication_factor==3 so a data item is replicated in 3 nodes. This shows that the data was correctly replicated across all our nodes and configuration was correct. Apache Cassandra has a lot of different configurations that were not covered in this tutorial and these can be found in cassandra's documentation.
- Study the replication mechanism of Cassandra
- Study the data sharding
- Learn how to scale up and down nodes of Cassandra
- Find a way to deploy in different machines so that you can test the network and performance
- Work on data storage for Cassandra
Further note: If you want to use the data set Avian Vocalizations from CA & NV, USA, you can use statements in (etc/sample_bird.cqlsh). Note that, we only use the metadata from the CSV file. Furthermore, we extract only a few fields. A sample of extracted data is here.
You can check Scylla which is another Cassandra-compatible implementation.
Note: We have a simple docker compose for exercises but it is not up-to-date.
- Strasdosky Otewa
- Linh Truong