Qing (Matt) Zhang's technical blog: May 2013

Thursday, May 23, 2013

Intelligent scan selectivity reduction: Sybase optimized index selection

Sybase ASE has a nice optimization feature called "Intelligent scan selectivity reduction", which optimizes the index selection, when the predicate columns do not include the prefix columns in the composite index. It seems there's not much documentation about it. So I'll give a simple introduction about this hidden feature here.

This feature is only available for data page lock(DPL)/data row lock (DRL) tables, when the following conditions are met:
- the table has a composite index
- index prefix column is not in predicate
- cardinality of the prefix column is fairly low

Without this feature, we may need to do a full table scan, or create new index on the predicates.
With this feature, the optimizer will treat the composite index search as a number of small sub-indexes to retrieve the row IDs.
- determine the domain of distinct values for the index prefix column
- iterate through each distinct value in this domain
- for each distinct value, the optimizer will perform a regular index scan

The Intelligent Scan feature will function as if you issued a sequence of SQL statements having each statement specifying a single value for the index prefix column. Probing the table several times using index scan is often more efficient than performing a table scan.

A simple example:
create table t1(c1 int, c2 int, c3 varchar(10), c4 varchar(100)) lock datarow
go
create index t1_i1 on t1(c1, c2, c3)
go
select c4 from t1 where c2 = 2
go
If c1 only has 10 distinct values, then optimizer will still use index ti_i1. For each c1 value, it'll use t1_i1 to find the mathcing c2 values. For large table this is often better than a full table scan on t1.

You can always turn off this feature by using trace flag 15353.

Thursday, May 09, 2013

Scaling Memcache at Facebook

Facebook has a new publication in this year's NSDI'13 conference. You can find the original paper by searching the title online. Here's my detailed notes on this paper:

System properties of FB that affects design:
- Read intensive: user consumes much more data than creates.
- data is fetched from heterogeneous data sources
These call for the importance of a flexible cache design.

Memcache is used as query cache for read load on databases.
- if read miss, select from db, then set(k, v) in the cache
- for write requests, delete cache entry instead of update.
Memcache can also be used as generic cache for applications.

In Cluster Performance

- Reduce latency on fetching
In memcache, items are distributed across servers through consistent hashing. Upon request, web server employ an all-to-all communication pattern, which may cause in-cast congestion, or single server bottleneck. Solution is focused on the memcache client(like a job control node) residing on each web server.
> parallel requests and batching: use a DAG to represent dependencies between data, and maximize the number of items fetched concurrently.
> client-server communication: keep client stateless. Use UDP for get request and avoiding establishment of connections, and TCP for set/delete to maintain reliability.
> in cast congestion: use a sliding window at client side to control the total number of outstanding requests.

- Reduce load on cache miss
> Leases: it's a token bounded with the key the client requested to set the value. It's used to address 2 problems: (1) stale set: set a value that's not latest; (2) thundering herd: heavy read/write to a specific key.
Additionally, when a key is deleted, it's transferred to a data structure for a short time before flushed. Request can return either a lease token or data marked as stale.
> Memcache pools: when used as general cache, we partition a cluster's servers into separate pools, depending the characteristics of the key access patterns.
> Replication: Under some conditions we may choose to replicate instead of further dividing the key space, if (1) keys are fetched together (2) data too big (3) request rate is heavy

- Failure Handling
dedicate a small set of machines named Gutter to take over the failed servers. When client receives no response, it assumes server has failed, and send 2nd request to gutter pool. If miss again, query the db and insert into gutter.

In Region Replication

Web and memcached servers are split into multiple frontend clusters. Along with a storage cluster containing the db, this defines a region. Multiple frontend clusters share the same storage cluster.

- Regional invalidations
Storage cluster is responsible for invalidating. web server also invalidates its own cluster after modification. On every database, deploy invalidation daemons, which inspects SQL statements, extracts deletes, and broadcast to the memcache in every frontend cluster.
Doing this directly may result in high packet rates. To address this, invalidation daemons batch deletes into fewer packets and send to a set of servers running mcrouter instances. These mcrouters then unpack each batch and route to destination.

- Regional pools
If user's requests are randomly routed to all frontend clusters, cached data will roughly be the same. For large and rarely accessed items, we can reduce the number of replicas by having multiple frontend clusters sharing the same set of memcache servers, called regional pool.

- Cold cluster warmup
Allowing clients in the cold cluster to retrieve data from the warm ones rather than the persistent storage.

Across Region Consistency

As defined earlier, each region consists of a storage cluster and several frontend clusters. In multi-region deployment, designate one region to hold the master database, and the others contain read-only replicas. Consistency is the challenge here as replica may lag behind the master.

- write from master
Same as discussed above: storage cluster invalidates data via daemons.

- write from non-master region
Employs a remote marker to indicate that data in the local replica db are potentially stale and queries should be directed to the master region. When a web server updates data with key k: (1) set marker r_k in the region, (2) write to master embedding k and r_k to be invalidated. (3) delete k in local cluster. When request for k getting a cache miss, a server will check existence of r_k to decide whether to direct the query to master or local region.

- operational consideration
Sharing the same communication channel for delete and replication, to gain network efficiency.

Single Server Improvements

- Performance optimization
> allow automatic expansion of the hash table
> make the server multi-threaded using a global fine-grained lock
> giving each thread its own UDP port to reduce contention

- Adaptive slab allocator
Allocator organizes memory into slab classes containing pre-allocated, uniformly sized chunks of memory. Items are stored in the smallest possible fit. Each slab class maintains a free-list of available chunks and request more if the list is empty. When server cannot allocate free memory, evict the LRU item within that slab class. Adaptively re-balance slab assignments if the currently evicted items are more recent than the average of the LRU items. Then the slab with the LRU is freed and transferred to the needy class.

- Transient Item Cache
A hybrid scheme, lazy eviction on most keys, proactively evicts short-lived keys when they expire. The short lived keys are put into a circular buffer/linked list called transient item cache indexed by seconds. Every second the head of the bucket is evicted and the head advances by one.

Tuesday, May 07, 2013

Some random notes on large data system design

Early stage:
Considerations in early stage are mostly focused on the architecture of
- DB: permanent storage and prevention of data loss at backend.
- Cache: increasing of response speed in front end.
- Sharding: load distribution over distributed network.

New techniques for low concurrency systems:
- GFS/HDFS
cons: single point failure, all metadata on master so capacity limited for many small files.
- Map/Reduce
- BigTable (HBase, Cassandra, etc)
- Dynamo:
Distributed Hash Table(DHT) based sharding to resolve single failure. (Amazon, Cassandra)
- Virtual nodes: extension of the traditional mod based hash. In case of node change, no need to remap all the data. Instead, using a bigger mod value and each node gets a range. Remapping is only moving of partial mode keys.
(Cassandra details it here: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2)

Considerations for high concurrency systems:
- Cache
- Index
- Sharding (horizontal vs vertical)
- Locking granularity

More general considerations:
- replication storage
the famous 3-replication: (local, local cluster, remote)
- random write vs. sequential write:
write concurrent large data into memory, then sequentially write to disk.

Qing (Matt) Zhang's technical blog

Thursday, May 23, 2013

Intelligent scan selectivity reduction: Sybase optimized index selection

Thursday, May 09, 2013

Scaling Memcache at Facebook

Tuesday, May 07, 2013

Some random notes on large data system design

Previous Posts

Archives

About Me