Techniques on large data analysis
Requirement of analysis on large data sets are exploding nowadays. Recently I did a brief investigation on the techniques on large data analysis.
1. Hashing
- Usage:
Fast searching, insertion, deletion. Data set often fits in memory.
- Design:
* hash function selection for different data type
* collison resolution
- Extension:
d-left hashing: separate hash table into d segaments. Hash into the one with less collision. If tie choose the left one.
2. Bit map
- Usage:
Fast searching, duplicate checking, deletion.
- Deisgn:
Use bit array to represent data.
- Extension:
* Bloom filter: Use multiple bits to represent one data point.
3. Bloom filter (I think it worths a dedicated entry here)
- Usage:
Duplicate checking, Set union, data dictionary
- Deisgn:
k hash functions mapping to the same domain. So for one data item there will be k set bits.
- Extension:
Counting bloom filter: expand each bit into a counter, then it can support deletion.
4. Heap
- Usage:
Top N in a large data set.
- Design:
(do i need to write anything here?)
- Extension:
Use a max-heap and a min-heap at the same time to maintain medium
5. Bucket
- Usage:
Looking for the k-th element, medium, duplicate or non-duplicate elements
- Deisgn:
* step 1: define the range of each bucket, count statistics in each bucket.
* step 2: compute the target's index in the corresponding bucket, look for the target. May further seperating bucket if remaining data is still too big.
6. External Sorting
- Usage:
Ordering, duplicate removing
- Design: (a sort-merge strategy)
* In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file.
* In the merge phase, the sorted subfiles are combined into a single larger file.
7. Inverted Index
- Usage:
Allow fast full text searches. It is the most popular data structure used in document retrieval systems such as search engines.
- Deisgn:
A mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.
8 . Trie
- Usage:
* Similar stucture of data, lots of duplicates.
* Storing dictionary or implementing approximate matching algorithms
- Design
Often use recursive definiton for prefix
- Extension
Compression
9. Database index
(Classic, should I explain here?)
10. MapReduce
- Usage:
Large dataset with limited data types. Distributed processing.
- Design
* "Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes.
* "Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output .