Cassandra TWCS must have TTLs

The Last Pickle did a great blog post on TWCS a little while ago, explaining how Time Window Compaction is great for certain time series data.

To recap TWCS is suitable for:

Data should only be inserted and not updated afterwards
Data must have a TTL attached
Data shouldn’t be explicitly deleted, it should only be expired via the TTL

These are only recommendations — Cassandra will allow these recommendations to be broken, but if you do, there will be disk usage problems in the future. In this post I will explain the problems that occur if the data does not have a TTL.

As the name suggests this works with Time Windows. Once a time window has passed, all the sstables created in that window are compacted together to create one sstable for the data created in the window. Once this happens the default behaviour is for the sstable to never be compacted again.

Therefore the only way for the data to be deleted from the disk now is for the whole sstable to be deleted, and Cassandra does this when all the data in the sstable has expired via a TTL and also the gc_grace_period has also expired.

So, as long as all the data within a sstable has a TTL the sstable will eventually be deleted, freeing up the disk space. However if there is a record within the sstable without a TTL, then that record never expires, and the sstable can never be deleted.

The obvious answer to this is to delete the record, but this doesn’t work. An sstable is immutable once written to disk, so the delete will create a tombstone, but this tombstone will be in a newly created sstable for the current Time Window, rather than in the original sstable. Therefore the original sstable stays the same and cannot be compacted.

So there is no way for the sstable, with all the expired data in it, to be deleted.

But it is actually a lot worse than this. The deletion of the whole sstable when filled with TTLs will only happen if it is the oldest sstable — therefore once one cannot be deleted, none of the subsequent ones can be deleted either. So suddenly all the TTL’d data is never deleted from the disk.

At first glance this is, at best, counter intuitive. However there is good reason for this. Consider the original record without a TTL causing the original problems. If an upsert occurs later to the record, changing some data values and also adding a TTL, then the record would go to a new sstable, with a TTL attached. At some point later all the records in that sstable will have expired, and the original record will not be shown in CQL as it has been overwritten by the expired record.

This sstable would then become, in normal circumstances, ready for deletion. But if this happens, then the original record without the TTL would come alive again, causing incorrect data to be displayed.

So let’s look at this in more detail:

CREATE TABLE twcs (
    id int,
    value int,
    when timeuuid,
    PRIMARY KEY (id, value)
) WITH CLUSTERING ORDER BY (value ASC)
    AND bloom_filter_fp_chance = 0.01
    AND comment = ''
    AND gc_grace_seconds = 60
    AND default_time_to_live = 300
    AND compaction = {'compaction_window_size': '1', 
        'compaction_window_unit': 'MINUTES', 
        'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'}

If we create this twcs table, with TWCS and compaction unit of 1 minute, there is a TTL of 5 minutes and a gc_grace_seconds of 1 minute, meaning the record will be deleted from the database after 5 minutes and eligible for deletion from the sstable after a further minute.

So let’s insert a record and look at the result including the TTL information:

insert into twcs (id, value, when) values (1, 1, now());

select id, value, dateof(when), ttl(when) from twcs;

 id | value | system.dateof(when)             | ttl(when)
----+-------+---------------------------------+-----------
  1 |     1 | 2019-06-03 13:34:50.635000+0000 |       287

So this shows us the record has been inserted and has a TTL, counting down from 300.

If we do a nodetool flush then look at the data directory, we can see the one sstable, and doing an sstabledump we can see the record within the sstable.

ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 15:36 md-1-big-Data.db

sstabledump md-1-big-Data.db
[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 46,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-06-03T13:34:50.612608Z", "ttl" : 300, "expires_at" : "2019-06-03T13:39:50Z", "expired" : false },
        "cells" : [
          { "name" : "when", "value" : "5ccb9db0-8604-11e9-8ac8-2943194aee43" }
        ]
      }
    ]
  }
]

When the TTL has expired, the record will be removed from the database, and show it has expired within the sstable:

select id, value, dateof(when), ttl(when) from twcs;

 id | value | system.dateof(when) | ttl(when)
----+-------+---------------------+-----------

(0 rows)

sstabledump md-1-big-Data.db
[
  {
    "partition" : { "key" : [ "1" ], "position" : 0 },
    "rows" : [
      {
        "type" : "row",
        "position" : 46,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-06-03T13:34:50.612608Z", "ttl" : 300, "expires_at" : "2019-06-03T13:39:50Z", "expired" : true },
        "cells" : [
          { "name" : "when", "value" : "5ccb9db0-8604-11e9-8ac8-2943194aee43" }
        ]
      }
    ]
  }
]

After a further minute this will become available for deletion within the sstable, and the sstable will be deleted:

sstabledump md-1-big-Data.db

Cannot find file /var/lib/cassandra/data/keyspace1/twcs-.../md-1-big-Data.db

So let’s try this again, with a bit more data inserted and the third record inserted without a TTL:

select id, value, dateof(when), ttl(when) from twcs;

 id | value | system.dateof(when)             | ttl(when)
----+-------+---------------------------------+-----------
  1 |     2 | 2019-06-03 13:55:48.219000+0000 |        84
  1 |     3 | 2019-06-03 13:56:50.162000+0000 |       146
  1 |     4 | 2019-06-03 13:57:41.793000+0000 |      null
  1 |     5 | 2019-06-03 13:58:30.482000+0000 |       246
  1 |     6 | 2019-06-03 13:59:20.132000+0000 |       296

ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 15:56 md-1-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 15:57 md-2-big-Data.db
-rw-r--r-- 1 cassandra cassandra 51 Jun  3 15:58 md-3-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 15:59 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 16:00 md-5-big-Data.db

md-3-big-Data.db contains the record without a TTL. Slowly the data with TTLs disappears, leaving only the immortal record. The first two sstables can be deleted, but md-3 blocks everything after it:

ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 51 Jun  3 15:58 md-3-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 15:59 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun  3 16:00 md-5-big-Data.db

Even if you “fix” the broken record by upserting a TTL, the new write lands in a new sstable (md-6) and the original md-3 is never touched. Eventually the visible record expires but the sstables remain on disk permanently.

In conclusion: every single record must have a TTL attached to it.

The safest way to enforce this is to set default_time_to_live in the table definition, making it very difficult to accidentally insert a record without a TTL.