Hadoop in 5 minutes

What is Hadoop really?

Hadoop is an open source development project managed by Apache. It is based on some technology originally written by Google. There are multiple modules (code bases) available and all perform different tasks, but all focused on the primary directive of Hadoop, processing large volumes of data in a highly available environment. So what does that mean in non-geek speak? Well, Hadoop is a free software package that provides tools enabling data discovery, question answering, and rich analytics based on very large volumes of information in fractions of the time required by traditional databases.

Did I mention is was free? If you don’t believe me, then click here.

What is all this stuff?

The two required components in any Hadoop cluster are a Hadoop Distributed File System (HDFS) and Hadoop MapReduce. These are the basic building blocks for any Hadoop setup, but there are many other “tools” or modules available for use. We’ll list off most of them here with a quick explanation of their primary use.

Module	Explanation/Use
HDFS	Hadoop’s File Share which can be local or shared depending on your setup
MapReduce	Hadoop’s Aggregation/Synchronization tool enabling highly parallel processing…this is the true “engine” or time saver in Hadoop
Hive	Hadoop’s SQL query window, equivalent to Microsoft Query Analyzer
Pig	Dataflow scripting tool similar to a Batch job or simplistic ETL processer
Flume	Collector/Facilitator of Log file information
Ambari	Web-based Admin tool utilized for managing, provisioning, and monitoring Hadoop Cluster
Cassandra	High-Availability, Scalable, Multi-Master database platform…RDBMS on sterioids
Mahout	Machine Learning engine, which translates into, it does complex calculations, algorithmic processing, and statistical/stochastic operations using R and other frameworks…it does serious math!
Spark	Programmatic based compute engine allowing for ETL, machine learning, stream processing, and graph computation
ZooKeeper	Coordinator service for all your distributed processing
Oozie	Workflow scheduler managing Hadoop jobs

Again, only HDFS and MapReduces are required to run a Hadoop environment, but these other modules certainly add extra power and flexibility to the setup.

How does it all work?

At this point, you’re probably asking how this is all possible? How could a free database platform be this cool? Well, the key is the way Hadoop processes information. Essentially, it distributes the load across a set of parallel processors running multiple smaller queries, rather than one large one, and then merges that data back together to return the full dataset.

Imagine a Saturday morning where you’re planning out your day with all the chores to complete and errands to run. Now, imagine that you cloned yourself, creating an separate clone for each of the chores and errands, and then completed each task simultaneously. A day filled with chores and errands would now be completed in minutes…this is Hadoop in a nutshell.

Conclusion

Hadoop is a powerful tool which is getting more powerful and flexible with each new release and module. It is not to be ignored and has a strong position in the marketplace as well as organizations as either a replacement or an extension of the existing Enterprise Data Warehouse (EDW). If you’d like to explore Hadoop more, then I recommend that you check out the free Hortonworks Sandbox VM. It will allow you to run a complete instance, including Hive, Pig, Mahout, etc., on your local machine.

What next?

Let Axian come to the rescue and help define your BI strategy, develop a roadmap, work with your business community to identify the next project, and provide clarity and direction to a daunting task. For more details about Axian, Inc. and the Business Intelligence practice, click here to view our portfolio or email us directly to setup a meeting.