Qing (Matt) Zhang's technical blog: Techniques on large data analysis

Requirement of analysis on large data sets are exploding nowadays. Recently I did a brief investigation on the techniques on large data analysis.

1. Hashing
- Usage:
   Fast searching, insertion, deletion. Data set often fits in memory.
- Design:
   * hash function selection for different data type
   * collison resolution
- Extension:
   d-left hashing: separate hash table into d segaments. Hash into the one with less collision. If tie choose the left one.

2. Bit map
- Usage:
   Fast searching, duplicate checking, deletion.
- Deisgn:
   Use bit array to represent data.
- Extension:
   * Bloom filter: Use multiple bits to represent one data point.

3. Bloom filter (I think it worths a dedicated entry here)
- Usage:
   Duplicate checking, Set union, data dictionary
- Deisgn:
   k hash functions mapping to the same domain. So for one data item there will be k set bits.
- Extension:
   Counting bloom filter: expand each bit into a counter, then it can support deletion.

4. Heap
- Usage:
   Top N in a large data set.
- Design:
   (do i need to write anything here?)
- Extension:
   Use a max-heap and a min-heap at the same time to maintain medium

5. Bucket
- Usage:
   Looking for the k-th element, medium, duplicate or non-duplicate elements
- Deisgn:
   * step 1: define the range of each bucket, count statistics in each bucket.
   * step 2: compute the target's index in the corresponding bucket, look for the target. May further seperating bucket if remaining data is still too big.

6. External Sorting
- Usage:
   Ordering, duplicate removing
- Design: (a sort-merge strategy)
   * In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file.
   * In the merge phase, the sorted subfiles are combined into a single larger file.

7. Inverted Index
- Usage:
   Allow fast full text searches. It is the most popular data structure used in document retrieval systems such as search engines.
- Deisgn:
   A mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

8 . Trie
- Usage:
   * Similar stucture of data, lots of duplicates.
   * Storing dictionary or implementing approximate matching algorithms
- Design
   Often use recursive definiton for prefix
- Extension
   Compression

9. Database index
(Classic, should I explain here?)

10. MapReduce
- Usage:
   Large dataset with limited data types. Distributed processing.
- Design
   * "Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes.
   * "Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output .

Labels: analytics, database

Qing (Matt) Zhang's technical blog

Thursday, December 16, 2010

Techniques on large data analysis

0 Comments:

Previous Posts

About Me