Posts

Showing posts from April, 2014

Experts Video - Hadoop World

Image
Watch the experts Doug Cutting, Cloudera's Chief Architect, and Todd Papaioannou  Splunk 's CTO as they share their thoughts on big data in Asia on  Bloomberg Television 's Singapore Sessions

Using Apache Hadoop and Impala with MySQL for Data Analysis

source: Cloudera blog;  Thanks to Alexander Rubin of Percona Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to  integrate MySQL with Hadoop . In this post I will show how to export a table from  MySQL to Hadoop, load the data to  Cloudera Impala  (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my  previous post . I’ve used  Cloudera Manager  to install Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs: Purpose Server specs Namenode, Hive metastore, etc + Datanodes 2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16GB RAM, RAID 10 with 8 SAS drives Datanodes only 4x PowerEdge SC1425, 2x Xeon ...

Automating things using IFTTT

Image
I am big fan of automation and recently was looking for a way to automatically tweet when I publish a new blog. Found  ifttt.com  which allows to create recipes like these and share with others. The recipes run every 15 minutes, so there is a delay of maximum 15 minutes between publishing a new blog and getting it posted into Twitter. The acronym IFTTT is a bit cryptic to remember and expands to ` IF   T his  T hen  T hat`. The service had been running for almost 4 years, but had been a bit flaky when using it. Was not able to create recipes for the first time and was also not able to add LinkedIn as a channel. Also, it's not possible to create multiple channels of the same type. For example, the new blog event cannot be send to multiple Twitter account. I had to create multiple accounts with IFTTT so as to add multiple Twitter channels. Also, creating multiple triggers for a single recipe is not possible for now. Also, it would be nice to have some co...

What is a Big Data cluster?

Image
Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple ` A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time. ` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here .                                                 In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade ma...

Big Data Trendz