Virtual Experience


Hello everyone. My name is Mehrdad Nurolahzade and I'm a member of the real time infrastructure team at Twitter. Today. I'm going to talk to you about the distributed database behind Twitter. I will take you through the journey of Twitter from MySQL to Cassandra, and then from Cassandra to building its own distributed database called Manhattan. The concept of Twitter started in 2006 with this Jack Dorsey tweet. Like many other platforms in mid 2000's, it was implemented on top of the open source MySQL database. The architecture at this point is quite simple; it's basically a monolith, which is internally referred to as the monorail. It's Ruby on Rails. At this point. Twitter is one of the largest Ruby on Rails shops, perhaps, or websites in the world. And this is backed by a single MySQL server with a single leader and a single follower. Now, this is a small platform at this point. It is basically powered by eight servers, the entire platform, and it's capable of processing as many as 600 requests per second. Now, in order to reduce the load on MySQL, the team uses Memcache and almost 90% of the reads are served from the cache. At this point in 2012, engineers are running out of disk on the MySQL servers. The number of tweets is exploding and the platform does not have enough disk basically to store all those tweets. Between 2008 and 2009, the rapid rise in popularity of the Twitter website causes engineers to struggle for scalability. Of course, the monolith is a source of pain for scalability, and there are other limitations as well at the storage layer that I'm going to talk about shortly. But, what is happening is that Twitter experiences failure during multiple high profile events. And, the infamous fail whale screen becomes part of the Twitter user experience. So what engineers do, of course, in addition to all the optimizations they're doing basically to improve the Ruby on Rails service, what they're doing is basically at the data layer, they're trying to improve the performance of MySQL. So more caching and trying to do all sorts of data denormalization, sharding of course and whatnot, but unfortunately, due to the time-based sharding strategy that they use at this point, they, introduce hot shards to Twitter's architecture, which again becomes a scalability bottleneck. By 2010, what happens is that engineers have started basically questioning the choice of Ruby on Rails for Twitter. They have started basically investing in Java virtual machine at this point. Some of the new components of Twitter at this point are implemented in Scala or in JVM in general. One of these components is gizzard. Gizzard is this sharding framework or library, which basically is developed in Scala. And it allows basically, you to use range sharding on top of any storage engine like MySQL. So what is happening is that Twitter has moved its main data sources, including tweets, users, and a social graph to sharded MySQLs basically empowered by gizzard. At the same time, the gizzard library is used to develop an efficient graph database called flockDB. And this is basically to support a user's social graph. Reddis is also being used; the timeline is now being served out of Reddis. In addition, Twitter has started exploring Cassandra. Now, Cassandra is interesting to Twitter because of its horizontal scalability and of course, Twitter has plans to basically utilize it for both new use cases, and also old use cases. The idea is that even tweets is going to be moved to Cassandra soon. And finally in 2010, Twitter is building its first data center. Up to this point, Twitter has been using managed hosting. Now 2011 is the year of the great migration for Twitter. Lots of great improvement to Twitter's scalability and availability happens in 2011. First of all, Twitter starts basically abandoning Ruby on Rails and moving towards a service oriented architecture built on top of JVM. And then, it starts developing, Finagle. Finagle framework is now, the RPC framework is built now, to basically support the communication between this element of this service oriented architecture that Twitter is building. And the second data center of Twitter comes up and of course it increases the availability of Twitter significantly. Now, this is the result of this migration in 2012. As you can see now, Twitter has this proper multilayered, service oriented architecture with multiple microservices. These microservices at this point, are running under Apache Mesos, which is again, developed at Twitter. But unfortunately even at this point, the main data sources of Twitter – tweets, users, and social graph – are still in MySQL and not in Cassandra, which I'm going to talk about why. Now Twitter's Cassandra adoption is interesting. As I mentioned, we started looking at Cassandra around 2010. This is the point of time that Cassandra is in version 0.5, I believe. And, Twitter is desperate at this point to increase the throughput of request processing. And Cassandra is promising because it's horizontally scalable, unlike MySQL. Now, Twitter starts basically using Cassandra for many use cases. By 2012, 2013ish, it has been used for almost 20 use cases. I guess the largest cluster at this point is roughly 200 nodes. And Twitter, of course, lacks Cassandra experience at this point, so it starts hiring engineers with Cassandra expertise, and engineers start basically to engage with open source community. Twitter starts engaging with other companies behind Cassandra, namely Facebook. And they're trying to understand basically the roadmap for the Cassandra project and also starts contributing multiple patches to the Cassandra project. And it's not just Cassandra at this point, it's other basically distributed database solutions as well, like HBase, for example. So Twitter is very active in this case because, it believes Cassandra is going to be the future of database. ...the database of choice at Twitter. Now, but unfortunately things do not turn out the way Twitter engineers anticipated. Soon they encounter pain, or roadblocks onboarding some new use cases to Cassandra. The first basically source of pain is Cassandra's gossip protocol. Twitter has some very large data sources like tweets and engineers encounter many, many issues trying to understand and then improve the scalability of gossip protocol beyond a few hundred nodes basically. And then beyond that, engineers noticed that some of the main features they're looking for do not exist in Cassandra. For example, counter support that the observability customer at Twitter needed did not exist. And finally, after two years of managing Cassandra clusters, the storage team at Twitter basically notices that a significant amount of their bandwidth is spent managing Cassandra. And this is primarily because Cassandra lacks automation in its ecosystem at this point. So engineers start to question suitability of Cassandra and start thinking that maybe satisfying the unique requirements of various use cases at Twitter would be more cost effective if they developed their own custom in house database solution. Now that's not the entire story. There is another dimension to this story and that's Twitter's Lambda architecture. So around 2012, before those extreme processing platforms are in place, Lambda architecture is very popular for processing large data sources under Hadoop. Twitter needs something to serve the outcome of its batching layer in its Lambda architecture. They look at Memcache, Redis, and Cassandra, and each of these solutions are not suitable due to various reasons because making them meet the requirements or integrate them with Hadoop is considered to be non-trivial. So they eventually think of building something in house, a read-only NoSQL system, and this is called Manhattan. And this is basically developed very fast, just a matter of a couple of months, maybe. Now the speed at which they develop this solution, makes them think that, hey, maybe if we build something in house specific to the requirements of Twitter, it's going to be much faster and much more efficient compared to what experience they had with Cassandra. And this is basically the point of time the priorities or the goals shift from making Cassandra work to now making something scalable, low-latency, with minimal operational overhead that can replace Cassandra. So the work is starts in 2012 and by 2014, pretty much all Cassandra use cases and many of the MySQL use cases are now migrated to Manhattan. Now Manhattan was ideas collected from multiple basically storage systems that Twitter engineers played around with. So they kind of actually designed it based on these principles. So they want it to be massively scalable. It should scale to thousands of nodes per cluster, and it should be highly available at the cost of low consistency. And for the sake of resolving these inconsistencies between data, it should basically use a last write wins strategy for conflict resolution. Now, given that Twitter has both read heavy and write heavy workloads, it should be efficient to satisfy both of these types of workloads. And it should satisfy both multitenant and multi-region type of clusters and different types of storage engines that are suitable for different types of workloads. It should be flexible and be schemaless and use random partitioning and minimize operations through automated operation and self service for their customers. The data model. Sorry, before that, let me talk about the tech stack. So primarily developed in Java with some elements in Scala. Python was also used for the sake of automation and tooling. Of course, finagle and Netty are used for the sake of communication between the client and server. Zookeeper is there for the sake of topology and currently, Manhattan is running outside containers, but we're in the process of migrating to Kubernetes. The data model is quite simple. It's just basically, every record has a partition key, a local key, which is optional, and a value, again, optional. And these components are scalar types, like strings or numbers or binaries even, or they can be composite types like, can be a combination of multiple scalar types. And the partition key decides which partition out of many thousands in the cluster this key belongs to. And hash function is used basically for the mapping of partition keys to partitions. And like I said, cluster can have thousands of partitions. And this mapping between partition can know this stored under topology in Zookeeper. Now similar to the 2007 paper, the Amazon dynamo paper, Manhattan's architecture is leaderless replication, which means that the stateless frontend coordinator replicates the request to all the replicas in the cluster and basically gathers responses from them. Now, if there is any conflict in the response that is returned by these replicas, the last write wins strategy is used, basically. So whatever timestamp is the highest, basically becomes the value that is returned to the client. Now, in terms of consistency, the coordinator waits until it receives a required number of responses from replicas and this is basically tuneable per request. So a request can have a consistency level of any, one, zero or all or quorum within a single region. And then the coordinator is going to wait until basically these number of requests are returned. Now, given that at Twitter, we use quorum reads and writes by default, this basically guarantees your own write consistency for the majority of our use cases. Of course, this is eventual consistency and in eventual consistency, inconsistency can happen. And similar to Cassandra, we you have read repair and anti entropy repairs for repairing the inconsistent state of data. In terms of storage engines, we have multiple of them, but the four that have been used more than any other ones are SeaDB, which is the very first one developed for the sake of the Lambda architecture. This is a read-only backend, basically very efficient in terms of basically look-ups, in memory or on disk. Also we have the SSTable backend, which is borrowed from the Bigtable, Cassandra also uses it and it's very efficient for the write heavy workloads. We developed the B-Tree based backend around 2016. This is basically based on MySQL's InnoDB and this was suited for a read heavy type of workload that had sort of lighter right workloads. And then RocksDB has become basically the dominant type of backend in Manhattan clusters. As of 2019, we use them basically for both read heavy and write heavy workloads. I mentioned Manhattan initially was designed to be eventual consistency, but we realized the fact that some of our customers are interested in strong consistency. So what we did later on was we extended the architecture of Manhattan and plugged this sort of replicated log solution, a Kafka-like solution, into its request processing path. So what is happening is that, uh, incoming requests are written to a shard of this replicated log based on the partition key. And these replicas on the other hand are guaranteed to see these incoming and requests in the exact same order. So this kind of actually provides a linearizable consistency on a partition key basis for the requests. We have local strong consistency within one region and we have global strong consistency within all regions. And at the same time, we allow these eventual type of reads, which are faster against this strong consistency data. Of course, due to the fact that the replicated log now sits between basically the request processing path, the end-to-end latency of these operations is almost an order of magnitude slower than the eventual consistency type of operation. Quickly talk about secondary indexing at Manhattan. It's similar to DynamoDB. We have both local secondary indexes and global secondary indexes. I'm going to talk about the differences shortly, but, uh, secondary indexes can be defined on the components of the local key and value. Now the global secondary index is one in which basically the primary key or the partition key of the index can be different from that of the master record. So in a sense, uh, the global, uh, secondary index can be used to read from multiple partitions. On the contrary, the local secondary index, uh, the partition key of the index and the master record needs to be the same. So in a local secondary index, you can only read from one partition at a time. And finally, we have also support for automated imports and exports in Manhattan. So what is happening is primarily with Hadoop, we have [integration to] for our read only clusters basically you can automatically import data from Hadoop; and for our read-write clusters, basically you can export data to Hadoop. So Hadoop is, of course, it's used in this architecture for the sake of various analytics aggregations, or, any sort of batch processing that you need. We're currently in the process of integrating this architecture with public cloud storage, as well, so soon our customers are going to be able to import data from public cloud or export it to public cloud and use these third-party integrations available to do various analytics. About the scale of Manhattan, we are at this point operating, I believe, close to 20 production clusters, and, um, probably we have more than a thousand customers. We definitely have more, more than a thousand databases in Manhattan. Um, and these are powered by tens of thousands of nodes. I believe the largest cluster we have at this point is over 4,000 nodes. And in terms of the size of the data, um, we are serving many petabytes of data in Manhattan, and this is, uh, served at the rate of probably tens of millions of requests per second. Now let's talk about the fun thing: operations. Well Operations are easy at a smaller scale. Um, so if you have operated small storage clusters, you, you already know that, you know, every once in a while, basically, maybe one of your nodes dies and then you have to replace it with a new node. Every once in a while, maybe you have to add capacity. Maybe you have to update the cluster. So because the set of operations or the space of operations that you have to do is basically relatively limited. And, um, you can easily reason about basically what to do or in what order to do, so potentially you're going to be doing this, uh, sequentially, maybe first remove the node then add a replacement for it, and then maybe update, uh, upgrade the cluster. So this is easy, but what if you are operating a cluster at the size of Manhattan? So you operating clusters at the size of hundreds of nodes or even thousands of nodes. So, because the cluster at this scale is so large, so is the space of things that can happen together. So you can have at any given point of time, many of these nodes failing. So an operator might be basically, uh, you know, acting on removing multiple of these nodes. And it could be at the same time that one operator is taking care of these bad nodes, there is a surge in traffic, and all of a sudden another operator is tasked to add extra emergency capacity to this cluster, uh, to basically, uh, allow serving, uh, this surge of traffic. And to make the Matter, even worse. Uh, it could be that, um, another operator and the team is basically, uh, trying to update the cluster at the same time. And these operations, um, the all can happen at the same time by different people. The question is like, how do you, how do you coordinate between these operations and these people? Um, like how do they know if an operation is safe to do at any given point of time, or in what order it has to be done, or how long should they babysit the cluster until the operation is complete? So in summary, doing operations safely and correctly in scalable clusters is very, very difficult. So for that matter, we developed, Genie. Genie is the name of our, um, Manhattan cluster manager. It's a service basically which has complete understanding of, uh, the cluster. For example, it knows, uh, the state of every single node, whether it's restarting or being removed or being added to topology, or is failing at this point. So whenever, uh, operators want to do something, they communicate their intent to Genie and then Genie, it knows exactly if the operation is, uh, safe to do at this point. So if it's safe to do is just enqueues that operation and tries to execute it with all the other operations that Genie is executing at this point. So this way, by pushing this knowledge and this basically burden of doing this to automation, Manhattan operators, Do not need to have you know that much of domain knowledge about how to do these operations, nor that they need to basically babysit Manhattan clusters while these operations are being executed. Now fast forward to 2020, What are we working on? So of course, uh, data protection compliance is top priority at Twitter. And I'm assuming in many other companies. We've been working very hard over the past few years, making sure that Manhattan is compliant via these regulations. Now I mentioned about RocksDB migration, we're finalizing migration to the last remaining clusters at Twitter. And Kubernetes is ongoing. We're basically porting Manhattan under Kubernetes. We're rebuilding our cross DC or cross region replication feature on top of Kafka. And as I mentioned, we are integrating public cloud storage into our, um, import and export pipelines. Now looking forward beyond 2020, um, we still believe Manhattan is the right solution for many use cases at Twitter. After all Manhattan was custom built to, uh, meet the official requirements of these, uh, use cases. Uh, so we're gonna be continuing to build on top of Manhattan. However, um, the diversity of use cases in a company at a scale of Twitter cannot really be efficiently served by a single database. Um, so we believe, um, if our customers try to, uh, use Manhattan for use cases, where it wasn't designed for then they're going to basically be facing inefficiency because of various work arounds and, um, you know, tradeoffs that they're going to be making. At the same time, if we try to extend Manhattan so that it satisfied every single use case, it's going to result in inefficiency of Manhattan at the same time. So our position is that we try to, uh, thoughtfully and logically evolve Manhattan, but at the same time, we want to be able to provide the best, uh, storage solution to Twitter. So what does that mean? Well, looking at the distributed database landscape, we see that it has significantly evolved since 2012, which [is when] we decided to build Manhattan. So multiple new open source solutions have basically popped up. And some of these are backed by companies that are operating even at the scale of Twitter, right. And existing solutions like Cassandra have evolved a lot. At the same time, these companies like Amazon, Microsoft, and Google, have started basically offering their internal databases as managed solutions. So at Twitter, we hypothesize that some of these solutions can potentially complement or maybe even replace our internal offerings in the future. Well, that's all I have. Um, but before I wrap it up, I just want to share with you a bunch of resources. So over the years we have, um, published a few blog posts and video presentations about Manhattan. So do check them out if you're interested to learn more about Manhattan. So there is one on strong consistency, uh, automated operations in Manhattan. And, uh, a look at secondary indexing in Manhattan. Thank you very much.
Talk / Speaker / Abstract

The Distributed Database Behind Twitter

Mehrdad Nurolahzade
Platform Engineer,

Twitter is giving hundreds of millions of people around the world the power to create and share ideas and information instantly without barriers. Operating such a service requires a global database system that can serve millions of queries per second, with extremely low latency in a real-time environment. In this talk we will look at Twitter's distributed database journey from MySQL to Cassandra to Manhattan; what led Twitter to build its own NoSQL database to meet the unique requirements of serving the public conversations; and what challenges are reshaping the future of distributed databases at Twitter.

< Back to Schedule page