This is a post to go with a recent presentation at the DataStax Accelerate conference, and describes the process used to move clusters from Rackspace to Google Cloud.
In Part 1 we looked at setting up the new datacenter and migrating the data to new nodes. In Part 2 we looked at decommissioning the original datacenter.
In this final post, we look at some of the things that can go wrong and how to mitigate and recover from them.
Network Performance
If you are moving to a new datacenter in a new location or with a different provider there may be network performance considerations, as all the data stored in Cassandra needs to be transmitted across the datacenters. This can cause two problems: ensuring there is enough bandwidth, and not stealing all the bandwidth.
The bandwidth available can be tested with iperf3:
iperf3 -s
iperf3 -c xxx.xxx.xxx.xxx
iperf3 -c xxx.xxx.xxx.xxx -b 10G
iperf3 -c xxx.xxx.xxx.xxx -C yeah
-sruns iperf3 in server mode — run this on one node in one datacenter-cruns in client mode — provide the IP address where the server is running-bspecifies the maximum bandwidth to use (in bits), useful if you don’t want to saturate the connection-Callows you to specify the congestion control algorithm — when moving between different external geographical datacenters, we foundyeahworked best, but test different scenarios
If you need to limit the bandwidth available for Cassandra to stream data between datacenters:
nodetool setinterdcstreamthroughput xxx
Views
This problem only occurred in DSE prior to 5.0.12 (Apache 3.0.13)
Prior to 5.0.12, when data is streamed to a new node, only the table data is streamed — not the view data. The view is rebuilt on the node from the underlying table data using SELECT statements. This normally works fine, however if there are a large number of tombstones on the table, these may cause the select to fail, even if the tombstone issue is not apparent in normal reading of the table.
When this happened we found the best solution was to upgrade to a later version, which is recommended anyway given the number of view issues fixed after this release.
Memory
During the rebuild of a node, the high level of streaming and compaction uses up a lot of heap memory. During the rebuilding process it is acceptable to increase the amount of heap, without having to worry about GC pauses — the node at this stage is not servicing client requests, so large pauses will not have any impact.
Compaction
There can also be a backlog of compaction during the process. You may have lots of small sstables queued for compaction, which also increases heap memory requirements. During the rebuild process it is acceptable to increase the compaction throughput:
nodetool setcompactionthroughput xxxxx
Streaming
Reducing the streaming throughput can reduce the amount of memory required, giving GC more chance of keeping up:
nodetool setinterdcstreamthroughput xxxxx
Application Latency
The area where we had the most problems was application latency. We had a 22ms latency between our datacenters — if you add this extra latency to normal queries there will quickly be latency issues building up.
It was important that the development teams followed the instructions sent out early in the process.
Local Consistency
If you use a QUORUM consistency, more than 50% of the nodes across all DCs have to return a result. With LOCAL_QUORUM, it is only more than 50% of the local nodes. So LOCAL_QUORUM stays in the local datacenter, whereas QUORUM will always have to cross the network to the remote datacenter.
Light Weight Transactions (LWT)
LWTs use an additional consistency for the Paxos algorithm for the IF statement check. This can either be SERIAL or LOCAL_SERIAL.
LOCAL_SERIALmeans the check only happens in the local DCSERIALmeans the check happens in all DCs, with the additional latency involved
During the initial build of the new datacenter the consistency should definitely be LOCAL_QUORUM. When the DCs are at steady state, more thought needs to go into the decision. If all applications are simultaneously moved to the new DC, then LOCAL_SERIAL should continue to be used. However, if it is a gradual migration, you need to be aware of one application writing to the new DC while the IF statement is being checked in the other DC — in this case LOCAL_SERIAL will not be enough and you should use SERIAL with the additional latency.
Load Balancing Policy
The Load Balancing Policy determines which node a client request is passed to. It is important that a DC aware policy is used, ensuring that requests go to the local nodes relative to the client. This is particularly important before the rebuild is completed — otherwise a request may be sent to the new DC before the data is actually there, resulting in no data returned even when it is present in the original DC.
Conclusion
We successfully migrated 93 clusters using this method. Some were easier than others, but after several rebuilds of a node we always succeeded.
If you have any further questions, get in touch directly.