In this post, i will try the analogy between how Cassandra provides durability to data and how it compares to well-know (at least for me) Oracle “method” of data persistence.
So Cassandra uses something called commitlog, where data that is inserted, updated or deleted is appended.
In that way Cassandra provide us the “D”, of Durability. As the operation of a write is always an “append”, there is no seek time (or is minimal) and the sequential write is much more efficient.
Also, there is a parameter called commitlog_sync that defaults to “periodic“, that will sync commitlog with disk every commitlog_sync_period_in_ms (1000ms of default in 2.x).There is also other modes of commitlog_sync, mainly “batch” that will tell Cassandra to group the “mutations” into batches and then sync it.
Does this commitlog job sounds familiar to you? Oracle redo log data is first written in a buffer memory (like commitlog) right before it is flushed to the disk. LGWR is also responsible for writting the commit records to the log file, also it also groups the “commits” (when needed) and flush it to the redo log files in “batch”.
LGWR doesn’t have a single parameter like “commitlog_sync_period_in_ms” or “commitlog_batch_window_in_ms” to control when flushing from redo buffer to disk happens, it has a bunch of conditions when to flush the contents to redo files (every 3 seconds, redo buffer is 1/3, before DBWR writes out a dirty block,etc).
As Cassandra on commitlog and Oracle on redo log files, both are used in case of disaster or node restart need to be replayed to unsure that recovery is done.
The little difference here is subtle: If you are really unlucky and all your Cassandra replicas crash during the “period_in_ms” or “windows_in_ms” (depending of your commitlog_sync parameter), your data is lost, unless you set is to 0.
That doesn’t happen (by design) in Oracle and LGWR. No matter what you do in the database, when it crashes you recovery everything that showed to you to “committed”.
But for Cassandra this is only part of the game, the other part of the game is something called memtable that is basically a structure in memory with the content stored in key-value pair and sorted by key. memtable also stores every single write until it reaches a specific threshold and then it is flushed to immutable and ordered disk structure called SSTables.
So the question is, how Cassandra knows what data is actually persistent as SSTable and in case of node disaster what to “replay”? The answer is simple: As soon the memtable is flushed to a SSTable the data in commitlog is purged!
Here, there is no archivelogs – data is purged as soon as it gets persistent in the nodes.
Update: As a few readers pointed out, after Cassandra 2.0 (my fault to not verify newer versions), an archive of commit log is possible when the node starts,when the commit log is written to disk or even at a specific point in time.
Comparing to Oracle, the concept is almost the same (except that Oracle and LGWR don’t necessarily archive at startup of the instance), but it is also used for “backup” and “recovery” supporting PIT restores.
Hope Cassandra is now more clear to you, as a Oracle DBA.
PS: Thanks isinghora and Ishinder Singh for point me out my error.