Moving Cassandra Clusters without Downtime – Part 3

This is a post to go with a recent presentation at the DataStax Accelerate conference, and describes the process used to move clusters from Rackspace to Google Cloud.

In Part 1 we looked at setting up the new datacenter and migrating the data to new nodes. In Part 2 we looked at decommissioning the original datacenter.

In this final post, we look at some of the things that can go wrong and how to mitigate and recover from them.

Network Performance

If you are moving to a new datacenter in a new location or with a different provider there may be network performance considerations, as all the data stored in Cassandra needs to be transmitted across the datacenters. This can cause two problems: ensuring there is enough bandwidth, and not stealing all the bandwidth.

The bandwidth available can be tested with iperf3:

iperf3 -s
iperf3 -c xxx.xxx.xxx.xxx
iperf3 -c xxx.xxx.xxx.xxx -b 10G
iperf3 -c xxx.xxx.xxx.xxx -C yeah

-s runs iperf3 in server mode — run this on one node in one datacenter
-c runs in client mode — provide the IP address where the server is running
-b specifies the maximum bandwidth to use (in bits), useful if you don’t want to saturate the connection
-C allows you to specify the congestion control algorithm — when moving between different external geographical datacenters, we found yeah worked best, but test different scenarios

If you need to limit the bandwidth available for Cassandra to stream data between datacenters:

nodetool setinterdcstreamthroughput xxx

Views

This problem only occurred in DSE prior to 5.0.12 (Apache 3.0.13)

Prior to 5.0.12, when data is streamed to a new node, only the table data is streamed — not the view data. The view is rebuilt on the node from the underlying table data using SELECT statements. This normally works fine, however if there are a large number of tombstones on the table, these may cause the select to fail, even if the tombstone issue is not apparent in normal reading of the table.

When this happened we found the best solution was to upgrade to a later version, which is recommended anyway given the number of view issues fixed after this release.

Memory

During the rebuild of a node, the high level of streaming and compaction uses up a lot of heap memory. During the rebuilding process it is acceptable to increase the amount of heap, without having to worry about GC pauses — the node at this stage is not servicing client requests, so large pauses will not have any impact.

Compaction

There can also be a backlog of compaction during the process. You may have lots of small sstables queued for compaction, which also increases heap memory requirements. During the rebuild process it is acceptable to increase the compaction throughput:

nodetool setcompactionthroughput xxxxx

Streaming

Reducing the streaming throughput can reduce the amount of memory required, giving GC more chance of keeping up:

nodetool setinterdcstreamthroughput xxxxx

Application Latency

The area where we had the most problems was application latency. We had a 22ms latency between our datacenters — if you add this extra latency to normal queries there will quickly be latency issues building up.

It was important that the development teams followed the instructions sent out early in the process.

Local Consistency

If you use a QUORUM consistency, more than 50% of the nodes across all DCs have to return a result. With LOCAL_QUORUM, it is only more than 50% of the local nodes. So LOCAL_QUORUM stays in the local datacenter, whereas QUORUM will always have to cross the network to the remote datacenter.

Light Weight Transactions (LWT)

LWTs use an additional consistency for the Paxos algorithm for the IF statement check. This can either be SERIAL or LOCAL_SERIAL.

LOCAL_SERIAL means the check only happens in the local DC
SERIAL means the check happens in all DCs, with the additional latency involved

During the initial build of the new datacenter the consistency should definitely be LOCAL_QUORUM. When the DCs are at steady state, more thought needs to go into the decision. If all applications are simultaneously moved to the new DC, then LOCAL_SERIAL should continue to be used. However, if it is a gradual migration, you need to be aware of one application writing to the new DC while the IF statement is being checked in the other DC — in this case LOCAL_SERIAL will not be enough and you should use SERIAL with the additional latency.

Load Balancing Policy

The Load Balancing Policy determines which node a client request is passed to. It is important that a DC aware policy is used, ensuring that requests go to the local nodes relative to the client. This is particularly important before the rebuild is completed — otherwise a request may be sent to the new DC before the data is actually there, resulting in no data returned even when it is present in the original DC.

Conclusion

We successfully migrated 93 clusters using this method. Some were easier than others, but after several rebuilds of a node we always succeeded.

If you have any further questions, get in touch directly.