Qing (Matt) Zhang's technical blog: Scaling Memcache at Facebook

Facebook has a new publication in this year's NSDI'13 conference. You can find the original paper by searching the title online. Here's my detailed notes on this paper:

System properties of FB that affects design:
- Read intensive: user consumes much more data than creates.
- data is fetched from heterogeneous data sources
These call for the importance of a flexible cache design.

Memcache is used as query cache for read load on databases.
- if read miss, select from db, then set(k, v) in the cache
- for write requests, delete cache entry instead of update.
Memcache can also be used as generic cache for applications.

In Cluster Performance

- Reduce latency on fetching
In memcache, items are distributed across servers through consistent hashing. Upon request, web server employ an all-to-all communication pattern, which may cause in-cast congestion, or single server bottleneck. Solution is focused on the memcache client(like a job control node) residing on each web server.
> parallel requests and batching: use a DAG to represent dependencies between data, and maximize the number of items fetched concurrently.
> client-server communication: keep client stateless. Use UDP for get request and avoiding establishment of connections, and TCP for set/delete to maintain reliability.
> in cast congestion: use a sliding window at client side to control the total number of outstanding requests.

- Reduce load on cache miss
> Leases: it's a token bounded with the key the client requested to set the value. It's used to address 2 problems: (1) stale set: set a value that's not latest; (2) thundering herd: heavy read/write to a specific key.
Additionally, when a key is deleted, it's transferred to a data structure for a short time before flushed. Request can return either a lease token or data marked as stale.
> Memcache pools: when used as general cache, we partition a cluster's servers into separate pools, depending the characteristics of the key access patterns.
> Replication: Under some conditions we may choose to replicate instead of further dividing the key space, if (1) keys are fetched together (2) data too big (3) request rate is heavy

- Failure Handling
dedicate a small set of machines named Gutter to take over the failed servers. When client receives no response, it assumes server has failed, and send 2nd request to gutter pool. If miss again, query the db and insert into gutter.

In Region Replication

Web and memcached servers are split into multiple frontend clusters. Along with a storage cluster containing the db, this defines a region. Multiple frontend clusters share the same storage cluster.

- Regional invalidations
Storage cluster is responsible for invalidating. web server also invalidates its own cluster after modification. On every database, deploy invalidation daemons, which inspects SQL statements, extracts deletes, and broadcast to the memcache in every frontend cluster.
Doing this directly may result in high packet rates. To address this, invalidation daemons batch deletes into fewer packets and send to a set of servers running mcrouter instances. These mcrouters then unpack each batch and route to destination.

- Regional pools
If user's requests are randomly routed to all frontend clusters, cached data will roughly be the same. For large and rarely accessed items, we can reduce the number of replicas by having multiple frontend clusters sharing the same set of memcache servers, called regional pool.

- Cold cluster warmup
Allowing clients in the cold cluster to retrieve data from the warm ones rather than the persistent storage.

Across Region Consistency

As defined earlier, each region consists of a storage cluster and several frontend clusters. In multi-region deployment, designate one region to hold the master database, and the others contain read-only replicas. Consistency is the challenge here as replica may lag behind the master.

- write from master
Same as discussed above: storage cluster invalidates data via daemons.

- write from non-master region
Employs a remote marker to indicate that data in the local replica db are potentially stale and queries should be directed to the master region. When a web server updates data with key k: (1) set marker r_k in the region, (2) write to master embedding k and r_k to be invalidated. (3) delete k in local cluster. When request for k getting a cache miss, a server will check existence of r_k to decide whether to direct the query to master or local region.

- operational consideration
Sharing the same communication channel for delete and replication, to gain network efficiency.

Single Server Improvements

- Performance optimization
> allow automatic expansion of the hash table
> make the server multi-threaded using a global fine-grained lock
> giving each thread its own UDP port to reduce contention

- Adaptive slab allocator
Allocator organizes memory into slab classes containing pre-allocated, uniformly sized chunks of memory. Items are stored in the smallest possible fit. Each slab class maintains a free-list of available chunks and request more if the list is empty. When server cannot allocate free memory, evict the LRU item within that slab class. Adaptively re-balance slab assignments if the currently evicted items are more recent than the average of the LRU items. Then the slab with the LRU is freed and transferred to the needy class.

- Transient Item Cache
A hybrid scheme, lazy eviction on most keys, proactively evicts short-lived keys when they expire. The short lived keys are put into a circular buffer/linked list called transient item cache indexed by seconds. Every second the head of the bucket is evicted and the head advances by one.

Qing (Matt) Zhang's technical blog

Thursday, May 09, 2013

Scaling Memcache at Facebook

0 Comments:

Previous Posts

About Me