Introduction to the data model of cassandra db

My last post was about Cassandra Set Up. Current article discusses Cassandras data model and objects. In essence Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key value nature is represented by a row object, in which value would be generally organized in columns. In short, cassandra knows following objects

Keyspace can be seen as DB Schema in SQL.
Column family resembles a table in SQL world (read below this analogy is misleading)
Row has a key and as a value a set of Cassandra columns. But without relational schema corset.
Column is a triplet := (name, value, timestamp).
Super column is a tupel := (name, collection of columns).
Data Types: Validators and Comparators
Indexes

Keyspace

Keyspaces are easy to understand, they are a first level collection to other objects. Every model begins by keyspace.

Rows and Columns

Cassandra organizes data in columns and rows of these. Rows are accumulated in collection object called column family.

A similarity to SQL Tables is noticeable here. Looking at columns we see that all of them have implicit external given timestamp (“ts”). Further we see that there is no rigid obligations for rows in a same colum family to have the same set of columns and column types. Also there is no obligation to provide a value for a column, it could be just name (and timestamp). Moreover cassandra allows to specify additional aspects per column, things like TTL. But it’s not so interesting for understanding a model generally.

Super Column

As we see such super column is a combination of simple columns with one single name. Such inclusion provides additional abstraction and access level. That actually also adds unnecessary complexity.

Hence super columns are not longer favoured. Nowadays it is recommended to manipulate C* data model by CQL and to use composite keys instead of super columns (more on this in the next tutorial).

Column families

As a typical NoSQL database, Cassandra does not enforce relationships between column families the way that relational databases do between tables. Therefore Apache Cassandra has no definition of foreign keys. Each column family has a self-contained set of columns that are intended to be accessed together to satisfy queries of your application. In addition there is not rigid schema, hence don’t think of column family as of some sort of relation tables, it’s better to think of them as structures like

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

and in case of super columun family as:

Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>

Data Types

And of course there are predefined data types in cassandra, in which

The data type of row key is called a validator.
The data type for a column name is called a comparator.

You can assign predefined data types when you create your column family (which is recommended), but Cassandra does not require it. Internally Cassandra stores column names and values as hex byte arrays (BytesType). This is the default client encoding.

Following table shows built-in Cassandra types:

ascii US-ASCII character string
bigin 64-bit signed long
blob Arbitrary bytes (no validation), expressed as hexadecimal
boolean true or false
counter Distributed counter value (64-bit long)
decimal Variable-precision decimal
double 64-bit IEEE-754 floating point
float 32-bit IEEE-754 floating point
inet IP address string in IPv4 or IPv6 format*
int 32-bit signed integer
list A collection of one or more ordered elements
map A JSON-style array of literals: { literal : literal, literal : literal … }
set A collection of one or more elements
text UTF-8 encoded string
timestamp Date plus time, encoded as 8 bytes since epoch
uuid A UUID in standard UUID format
timeuuid Type 1 UUID only (CQL 3)
varchar UTF-8 encoded string
varint Arbitrary-precision integer

Indexes

The understanding of Indexes in Cassandra is requisite. There are two kinds of them.

The Primary index for a column family is the index of its row keys. Each node maintains this index for the data it manages.
The Secondary indexes in Cassandra refer to indexes on column values. Cassandra implements secondary indexes as a hidden column family.

Primary index determines cluster-wide row distribution. Secondary indexes is very important for custom queries. Cassandra’s native index is like a hashed index and has limitation on range queries.

Let me know if you would like to read more on the topic.