What is Hadoop? Part one: Introducing Hadoop
When it comes to big data and companies, like Digital Contact (who deal with it every day), “Hadoop” has been a buzz word closely associated with big data storage. But, the majority of people and companies who are not directly involved with it may not really know what Hadoop is (other than an unusual word).
In this short series of blog posts we aim to introduce you to Hadoop. Hopefully by the end of this series, you will have more of an understanding of how Hadoop works (without needing a degree in coding), where it originated from and why it is so useful today.
So… what is Hadoop?
A typical, succinct and direct response to this question that you might find on the Internet would be something like: “Part of the Apache software project and sponsored by the Apache Software Foundation, Hadoop is a free Java-based, open-source programming framework. It enables the distributed processing of large data sets across clusters of servers.”
Clear as mud, right?
Ultimately, Hadoop is a programming framework that allows (in theory) an infinite amount of data to be stored and processed with ease. It has been designed to scale up from a single server (also referred to as a ‘node’) to thousands of machines with a high degree of fault tolerance (no loss of data should a node fail).
You may also hear the term “cluster” banded around when dealing with Hadoop. This refers to a set of connected nodes that work collectively – ultimately acting as one entity/computer system, but utilising the power of all the machines to deliver faster and more efficient results.
But still, to the average person, most of what has just been described is a confusing mess of technical terms and buzz words.
In English, please
It’s hard not to get bogged down in technical words, with something that is, at its heart, a very technical thing. But if I were to try and explain Hadoop to someone who doesn’t know anything about computers or technology (like my mum), I would break it down to something like this:
When people talk about “Hadoop” they are often referring to the “Hadoop infrastructure”, because, to clarify, Hadoop on its own does not process data, or produce analytics and insights; in simple terms it stores data, lots of data.
Hadoop is, essentially, a collection of hard drives connected together, that can have as many additional hard drives added to it as it required by the user – in theory, creating an unlimited amount of storage space.
Just to put the amount of data involved into some perspective: the average desktop PC currently has about 1-terabyte of storage capacity – in a Hadoop cluster we’re talking about thousands of terabytes of data storage.
And when you have such a large storage capacity, accessing that data could be an issue. For example: imagine having thousands of terabytes of data stored, across thousands/millions of files on your desktop PC at home and then trying to search for a file that you saved a long time ago; it would take a *very* long time to find the right file – even with standard automated search functions.
So, Hadoop (on its own) is able to store this vast amount of data, but how do users utilise this mass of 1s and 0s and gain valuable insights to benefit their companies?
To help retrieve and understand this data, Hadoop works with other applications/programs, each designed with specific roles within the Hadoop infrastructure.
We will go into more depth on some of the key programs in a later blog post, but for now, you just need to understand that the way the Hadoop infrastructure handles large volumes of data is by one program (usually HDFS) breaking the files down into more manageable chunks and distributing them across the available storage space. Another program (usually MapReduce) is used to find and retrieve the data, by searching for the separated chunks of data and combining them to make a complete file.
Through the rest of this series, we will go into more detail on the inner-workings of Hadoop, its origins, benefits, why many are finding it necessary to use and clear up some common misconceptions surrounding the Hadoop infrastructure.
In our next post we delve into the brief(ish) history of Hadoop's origins and some of its benefits.