Big changes are coming in the New Year with the launch of our new brand!   - If you have any questions in the mean time, drop us a line!

+44 (0)330 313 1000

by Graham Cookson
in Big Data , Hadoop , Apache
on 27 July 2015

What is Hadoop? Part four: At the heart of Hadoop

Continuing our blog series “What is Hadoop?” this time we will be looking at the heart of Hadoop, or rather the key programs that *most* companies tend to implement into their Hadoop ecosystems; the programs that help to make Hadoop the powerful and very useful infrastructure that it is.

As mentioned in a previous post in this series, in its basic form, Hadoop is essentially a very large hard drive – one that is (in theory) infinitely expandable.

But when people talk about ‘Hadoop’, they tend to refer to the infrastructure as a whole. You see (and this was also touched upon in the same earlier blog post), what makes Hadoop truly powerful and versatile enough to handle big data problems are the programs that run within the Hadoop infrastructure.

Being an open-source project, Hadoop has allowed developers and companies to create programs that utilise the infrastructure to aid them with solving big data problems that they specifically have.

Essentially what we have is an ecosystem of programs, or to be more accurate ‘sub-projects’ (programs in continuous development, as part of the larger, ongoing Hadoop project), which make up the ‘Hadoop’ that people commonly refer to.

What projects make up Hadoop then?

Now this is where things get a little tricky when trying to explain Hadoop to people who know nothing about it. There is technically no right or wrong answer to the question.

A company can run a Hadoop cluster consisting of nothing but Hadoop in its barest form (so it acts as a large, scalable hard drive). Though a company doing this is, frankly, uncommon; to get the true value and benefits to a business, the majority of companies will use sub-projects to help them with ‘problems’ like: big data analytics, faster retrieval and processing of data, data security/stability… pretty much whatever they want it to do.

At the time of writing this article, the current Apache Hadoop ecosystem consists of approximately 15 open-source sub-projects (that number is always changing). The most common names you will hear mentioned are: the Hadoop Common package, MapReduce, Hadoop Distributed File System (HDFS) and YARN.

That said, the two main sub-projects are usually considered to be MapReduceand HDFS, because without these two projects the Hadoop ecosystem would not work in the way it does.

Bread and butter

HDFSis essentially the bread and butter of the Hadoop framework. It stores large data files (typically ranging from gigabytes to terabytes) across multiple servers/nodes and is the process that makes Hadoop ‘fault tolerant’.

The basis of HDFS is that it assumes that the nodes the data is stored on will, at some point, fail. Because of this assumption, HDFS replicates the data across multiple nodes and distributes them accordingly in blocks across the nodes, in such a way that will be reliable and can be retrieved faster. A typical HDFS block size is normally 128mb, making each block small enough to handle easily and allows each block to be replicated multiple times - however, block sizes can be configured to meet the user's requirements.

This process ensures that there is no data loss when/if a node should fail – because the data is replicated, when a node fails another node can jump in and process the data instead; making Hadoop a very reliable method of data storage/retrieval.

The MapReduce software framework is, in a nutshell, a program that is used to search and retrieve information from a Hadoop cluster in an efficient way.

As mentioned, the way that a Hadoop cluster is set up (using HDFS), means that all the information stored on the nodes are broken down and replicated across the servers – so a standard method of searching for data (like a standard home PC might try) will either not work at all, or would take far too long – because programs generally are not designed to run across multiple machines, just one.

MapReduce is a process that is split into two parts, ‘Map’ and ‘Reduce’ (cleverly named, eh?).

Map essentially takes the problem (requested search query), splits it into sub-parts, and sends the sub-parts to the different nodes within the Hadoop cluster – so all the pieces run at the same time.

Reduce then takes the results from the sub-parts and combines them back together to get a single answer.

Ultimately, it allows an application/query to be broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster, on any block of data. It is a very effective way to take a big task and divide it into discrete tasks that can be done in parallel.

~Click image to see full size version of infographic~

What’s in your Hadoop?

As mentioned, companies will use Hadoop in different ways. The beauty of Hadoop’s open-source infrastructure means that companies can select and utilise the best Hadoop programs/projects to fulfil their goals – no company is alike and therefore neither are their uses of Hadoop.

At Digital Contact we have setup our Hadoop infrastructure with a variety of different programs and programming languages.

Without boring you with too much detail, here’s a brief rundown of the programs and programming languages unique to Hadoop that we use within our infrastructure.

*Note: Digital Contact is continuously evolving and updating its technology, this list is correct at the time of writing this article – but is subject to change in time.

  • HDFS: (see above)
  • MapReduce: (see above)
  • Apache Pig: Pig is a high-level platform for creating MapReduce programs. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.
  • Apache Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 file system.
  • Apache Spark: Spark is an open-source cluster computing framework. In contrast to Hadoop's two-stage disk-based MapReduce archetype, Spark's in-memory primitives are said to provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms.
  • Apache HBase: HBase runs on top of HDFS, providing a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records). 

So everyone’s Hadoop is unique?

As mentioned, the overall Hadoop project is still ongoing – it is continuously evolving and will do so for the foreseeable future. Even if something new comes along to replace Hadoop, the fact that it’s an open-source project and no one technically owns the rights to it or any sub-projects within it, there will always be someone tinkering and developing for it.

Unique is a strong word - can we say for certain that everyone’s Hadoop is unique? Certainly not.

But because of the open-source nature of the Hadoop project – it means that any company or individual developer is able to develop their own sub-projects that tap into Hadoop’s infrastructure.

While there are some well-established programs which the majority of companies will use, it’s fair to say that the majority of Hadoop clusters owned by companies will differ from the next.

Even if another company used exactly the same programs/sub-projects that we (Digital Contact) use in our Hadoop infrastructure, there’s a strong chance they won’t use it in quite the same way we do.

Digital Contact has established its own, physical, Hadoop infrastructure and we have control over its development and evolution. While we are technically not developing directly towards the Hadoop project ourselves, we are free to choose and utilise whatever open-source sub-projects become available to our advantage.

In the next blog post in our ‘What is Hadoop?’ series, we will take a pause and look at additional information about Hadoop you might need to know and clear up any widespread misconceptions you might have read or heard about Hadoop or its sub-projects.

Part three: The evolution of data | Part five: But before you begin... - coming soon -

Liked this? You may also be interested in: