Tuesday, May 07, 2013

Some random notes on large data system design

Early stage:
Considerations in early stage are mostly focused on the architecture of
- DB: permanent storage and prevention of data loss at backend.
- Cache: increasing of response speed in front end.
- Sharding: load distribution over distributed network.

New techniques for low concurrency systems:
cons: single point failure, all metadata on master so capacity limited for many small files.
- Map/Reduce
- BigTable (HBase, Cassandra, etc)
- Dynamo:
Distributed Hash Table(DHT) based sharding to resolve single failure. (Amazon, Cassandra)
- Virtual nodes: extension of the traditional mod based hash. In case of node change, no need to remap all the data. Instead, using a bigger mod value and each node gets a range. Remapping is only moving of partial mode keys.
(Cassandra details it here: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2)

Considerations for high concurrency systems:
- Cache
- Index
- Sharding (horizontal vs vertical)
- Locking granularity

More general considerations:
- replication storage
the famous 3-replication: (local, local cluster, remote)
- random write vs. sequential write:
write concurrent large data into memory, then sequentially write to disk.


Post a Comment

<< Home