Thursday, May 24, 2012

Big data talk by Dheeraj from Zynga

Went to the meetup "Big Data - Challenges, Solutions and A Comparative Analysis" organized by Taleo this evening. Dheeraj Soti, principal engineer from Zynga gave the talk. An interesting fact: Zynga is using Vertica as their big data analytics platform (they're actually the largest customer of Vertica, and their system size has grown from single digit to 1000 nodes nowadays). I was just wondering why they didn't use any of the popular NoSQL solutions such as Hadoop. As Dheeraj introduced, the data are collected from many game centers and then bulk load to Vertica. General reports are run overnight. I had thought Zynga is a social game company, and NoSQL should be very effective for social data analysis, and are actually deployed by most social leading companies such as Facebook. Maybe I had some misunderstanding in his talk. But can it be true that Zynga didn't use any NoSQL?

Labels: ,

Sunday, May 13, 2012

Hadoop install on Ubuntu Linux

Some friends asked me about how to setup Hadoop in Ubuntu Linux. I would strongly suggest Michael G. Noll's tutorial online: running-hadoop-on-ubuntu-linux-single-node-cluster
It's a very clearly written step-by-step tutorial and especially good for new users.

For his multi-node cluster setup, I do have some additional explanations on the network setup that's ommitted in his tutorial. So if you're done with his single node tutorial and got problems with the cluster setup, you may try examine the setups here:

1. Setup multiple single nodes
You can simply following Michael's tutorial in the above link.

2. Cluster Networking setup
2.1 Update hostname if needed:
//to display hostname:
hostname
//to change host name:
 vi /etc/hostname
 vi /etc/hosts


2.2  networking setup
2.2.1 set up static IP adress in /etc/network/interfaces in each node
Example contents in the interfaces file:
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 192.168.0.1
netmask 255.255.255.0
gateway 192.168.229.2


2.2.2 Update /etc/hosts on all machines. Example:
192.168.0.1    master
192.168.0.2    slave

192.168.0.2    slave2

2.3 SSH access control
2.3.1 Truncate all files in $HOME/.ssh,  and regenerate SSH keys as in the single node
2.3.2 Add master's key to slave's authorized_keys file, to enable password-less SSH login
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
2.3.3 On slave machines do the same and add key's to master's authorized_keys file
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@master
2.3.4 Try ssh from both master and slave to master/slave

3. Hadoop setup
You need to modify the namespaceID of the datanodes to the same as that of the namenode in the master.
The namespaceIDs of the namenode and datanode are defined in:
${hadoop.tmp.dir}/dfs/name/current/VERSION
${hadoop.tmp.dir}/dfs/data/current/VERSION

Remember that here the {hadoop.tmp.dir} is setup in conf/core-site.xml

if you don't do this, you might get the error: java.io.IOException: Incompatible namespaceIDs in logs/hadoop-hduser-datanode-master(or: slave).log

After all these are done, you can then follow the multinode cluster setup tutorial in Michael's link.

Labels:

Friday, May 04, 2012

A nice trick to truncate a file in linux

I alway use /dev/null to truncate a file. However, just found that you can use a simple command:
 :>filename
to truncate a file to zero size in linux.

Labels:

Wednesday, May 02, 2012

SAP HANA architecture

SAP HANA in memory computing is the key to the success of SAP high performance analytics. Here's some notes about its architecture.

Data loading:
- Load directly from other source such as ASE. ERP data are loaded through replication server

Query processing:
- standard SQL goes through the normal execution path:
    parser -> db optimizer -> (output physical plan) -> executor
- other scripts/language are processed in Calc Engine:
       Model optimizer/executor -> (logical plan) -> db optimizer ->
                db optimizer -> (output physical plan) -> executor

HANA In memory computing engine:
HANA supports both column store and row store. The persistence layer maintains data consistency. The in-memory computing studio provides powerful development platform. Some tech details of each module:

Row Store:
- It writes into temporary transactional version memory.
- The version memory consolidation moves visible version into persisted segment and clears outdated data from transactional version memory.
- tables are linked list of 16k memory pages, grouped in segments.
- each table has a primary index, contains row id - page mapping. Index only exists in memory and filled on-the-fly when loading.
- persistence layer invoked in write log, and writes savepoint -> checkpoint

Column Store:
- read optimized and efficient compression
- delta storage for (asynchronous) fast write.
- read always from both main and delta storage and merge results
- compression in main storage, during delta merge.
- Need a 2nd delta storage during delta merge for continuous read-write ops. 

Persistence layer
- Regular savepoints (full image),
- logs after last savepoints, for restore
- snapshots for backups.
Upon system restart:
- For row store: complete row-store is loaded in memory
- For column store:
  > preload-marked column store tables are loaded.
 > loading on demand: restore on first access.

In-memory computing studio (Information Modeler)
The modeling process flow:
--> Metadata(table) creation
--> table data loading
--> View creation
--> column view deploy
--> data consumption

I believe the In-memory computing studio is the biggest promotion of SAP HANA. The information modeler provides different attribute/analytic/calculation views into the data, integrates a complete development studio on top of HANA, which greatly simplified the work of app developers.

Labels: ,

SAP high performance analytics and database strategy

SAP's goal is to become the No. 2 Database company by 2015. It's high performance analytics based on in-memory computing HANA is the key component, and attracts much attention these days. Here's some brief notes:

SAP High performance analytics is composed of the following part:
- SAP In-memory computing engine (core of HANA)
- SAP In-memory computing studio (development platform)
- Business applications or other data sources.

After the acquisition of Sybase, SAP database technology now has a complete suite of solutions that covers the needs in all areas of business analytics. Here's the positioning of all the DB products:

- HANA: this in memory platform provides real-time decisions. It's the target platform for the SAP NetWeaver BW.

- Sybase ASE: it is the main supported option for SAP Business Suite applications, which is GA in Apr. 2012. HANA will augment the extreme transactions of SAP Sybase ASE with real-time reporting capabilities.

- Sybase SQL Anywhere: front-end database for HANA, extending its reach to mobile and embedded applications in real time.

- Sybase IQ: a “big data” analytics tool, providing progressive integration with HANA for working with aged or cold data. HANA and Sybase IQ are also planned to extend support for accessing “big data” sources such as Hadoop, offering a pre-processing infrastructure.

If we classify from the business analytics phase:
- Front end data source: Transactional data from ASE, or Streaming data from ESP
- Real-time analysis: HANA warehouse and data marts (for example: data from most recent 3 months)
- Progressive analysis: IQ (for example: data from past 19 months)

Labels: ,