Qing (Matt) Zhang's technical blog: Dremel: PB analysis in 3 seconds

Google published its paper "Dremel: Interactive Analysis of WebScaleDatasets" in VLDB 2010. The purpose of this work is to perform interactive data analysis at scale, on shared clusters of commodity machines, with fast speed and fault tolerance. It access data in situ (in place), and execute queries on such data at a fraction of time compare with Map Reduce. But it's not a replacement of MR, instead it's used to analyze outputs of MR pipelines. A typical use case is to first use MR and pre-process the input data containing signals, then run some trial interactive queries in Dremel in seconds to verify the signal and improve the algorithm. Finally come out with a mature long term job.

Dremel is built on the idea of web search and parallel DBMSs: a query gets pushed down the tree, and is rewritten at each step. Then replies are aggregated and assembled as result. It also provides SQL-like language to express ad-hoc queries, execute them natively without translating into MR jobs as pig or hive.

Dremel adopts a nested columnar storage model, claiming the properties of lossless representation, fast encoding, and efficient record assembly. It uses the Protocol Buffers as the serialization approach. The syntax is as:

t = d o m | < A 1 : t [* | ?], . . ., A n : t [* | ?] >

where * denote repeated fields, and ? denote optional fields. The column striped representation is based on the notion of Repetition Levels and Definition Levels:
- Repetition Levels: the value is repeated at which repeated field
- Definition Levels: how many fields that are undefined(optional or repeated) are actually presented.
Here's an example from the paper:

http://yankaycom-wordpress.stor.sinaapp.com/uploads/2012/12/13458646314107_f.jpg

The paper gives formulas, and computation of some example entries. I'd like to give a complete explanation of the values in Name.Language.Code and Links.Backword here:

- Name.Language.Code:

> (en-us, 0, 2): it's the first Code in document r1, so r=0. code is required, so we only need to look at its parents for d. Parent is Name.Language, so d=2

> (en, 2, 2): it's in 2nd language, so r=2. d=2 as above.

> (null, 1, 1): the next Name has only url field, so Code is null. This null Code is in the 1st language, so r=1. Only Name is present, so d=1

> (en-gb, 1, 2): it's in the 1st language of its Name, so r=1. Name.Language exists, so d=2

> (null, 0, 1): this is document r2, the 1st name has no Code, so it's null. r=0 denotes a new record. Only Name exists, so d=1.

- Links.Backward:

> (null, 0, 1): r=0 as it's the 1st in doc r1. But it has no Backward value, so null. d=1 as Links exists.

> (10, 0, 2): r=0 as we're finished with doc r1, and start doc r2. Links.Backwards are optional/repeated, so d=2.

> (30, 1, 2): r=1 as 30 is in 1st Links. d=2 as above.

When storing it, each column is stored as a set of blocks, containing the r and d values, and compressed filed values. Many compression techniques can be used here to reduce the storage cost, by utilizing the properties of the encoding algorithm.

To assembly the records efficiently, the key idea is to use a FSM labeled by repetition labels. The FSM is traversed once for each record. Record assembly is for record-oriented data processing such as MR. Dremel's query execution itself can bypass this step.

Dremel adopts a SQL-like language. It inputs one or more tables with schemas, and output a nested table and its output schema. The selection operator prunes away branches of the tree, and the projection operator emits a value at the same level of nesting, as the most repeated input field.

In query execution phase, Dremel adopts a multi-level serving tree. Root server receives incoming queries, reads metadata from tables, and routes queries to next level. The root determines all horizontal partitions of the table, and intermediate servers performs parallel aggregation of partial results. The query dispatcher maintains a histogram of tablet processing times, and can re-dispatch the job based on the system load. Each server has an internal execution tree, corresponds to a physical query execution plan. The selection-projection-aggregate queries is specially optimized as it's commonly used. It can do within record and cross record aggregation directly on column data, and bypass record assembly entirely.

Qing (Matt) Zhang's technical blog

Sunday, October 20, 2013

Dremel: PB analysis in 3 seconds

0 Comments:

Previous Posts

About Me