now live!

Click here to sign up for free

+44 (0)330 313 1000

by Graham Cookson
in Big Data , Hadoop , Apache
on 27 March 2015

What is Hadoop? Part two: The brief(ish) history of everything Hadoop

In the second part of ours blog series “What is Hadoop?” we take a step back and look at the origins of the platform and how it came to be as it is today.

Where Hadoop truly originates depends slightly on who you talk to. Some people say it comes from Google, others say Yahoo! and some say Doug Cutting and Mike Cafarella together single-handedly invented Hadoop.

Well, the fact is that without all of the above, Hadoop wouldn’t exist today – at least not in its current form.

You see, it was a mixture of all of these ideologies that contributed to the initial foundation and implementation of what we call ‘Hadoop’.

Arguably, yes, it was Doug Cutting and Mike Cafarella who ‘invented’ the idea of the Hadoop infrastructure and Doug Cutting is (more often than not) credited as being the creator of Hadoop – with good reason.

In the beginning…

The genesis of what became Hadoop dates back to the start of what is known as “Web 2.0” (circa 1999 onwards). This is when the Internet had been established for some time, but a dramatic shift occurred – rather than web pages being passively viewed by the user, they were becoming interactive and allowing users to collaborate. It was the dawn of user-generated content, from social networking sites, blogs, wikis and video-sharing sites.

This shift in the way content and web pages were being developed lead to the need for better search engine technology.

In 2002, the then-Internet Archive search director, Doug Cutting, and University of Washington graduate student, Mike Cafarella, set out to build a better open-source search engine. Together the two developed a project called “Nutch” – which was, essentially, an advanced ‘web crawler’, designed to fetch/crawl and index web pages.

By June 2003 a successful 100-million-page prototype/demonstration version of the system was developed. However there were some problems.

Nutch had very much been developed with the Internet how it was back then in mind, not really and consideration for the rapid expansion that inevitably came with Web 2.0.

Apparently, Nutch was tricky to get working, it could only run on a handful of machines and required someone to constantly watch it, to make sure it didn’t break or fall down.

Google to the rescue

This is where Google came in and effectively, inadvertently, revolutionised the Nutch project. In 2003, Google released a paper on its Google File System (GFS) and in December 2004 went on to release another paper documenting its MapReduce technology. Both these technologies, GFS and MapReduce, paved the foundation for Hadoop.

On developing Nutch and seeing Google’s papers released, Doug Cutting said: We were trying to build an open-source project that would crawl and index the entire web. Sort of an open-source version of what Google and Microsoft Bing do today.

“And we were trying to build that and we realised that you couldn’t do it on just one computer. It took more time to download the entire Internet, more time to process it than you could do on just a single computer. So we knew we needed to have it distributed across many machines.

“Around that time Google published a paper talking about how they were doing this. We, right away, saw the light and went about re-implementing what we’d done so far on top of the methods that Google described. So that was the real genesis of Hadoop.”

It should be noted, though, that Google wasn’t the only company working on this technology. It has been said that many big companies were facing the same problems and all trying (some succeeding) to make their own ‘MapReduce’ programs.

Raymie Stata, former Yahoo! CTO, praised MapReduce as “a fantastic kind of abstraction” over the distributed computing methods and algorithms most search companies were already using.

Stata went on to say: “Everyone had something that pretty much was like MapReduce because we were all solving the same problems. We were trying to handle literally billions of web pages on machines that are probably, if you go back and check, epsilon more powerful than today’s cell phones. There was no option but to latch hundreds to thousands of machines together to build the index. So it was out of desperation that MapReduce was invented.”

After taking interest in MapReduce, Cutting and Cafarella spent the next few months building the underlying file systems and processing framework that would become Hadoop and ported Nutch on top of it.

This framework now allowed for Nutch to run across 20-40 machines and no longer required a person to continuously monitor it for breakages.

Enter Yahoo!

In 2006, Cutting went to work at Yahoo!. The company too is said to have been impressed with Google’s GFS and MapReduce papers and, much like Cutting and Cafarella, wanted to develop open-source technologies based on them.

At this point in time Cafarella moved away from corporate IT and instead focused his career on education – he is now a professor at the University of Michigan – though still working on projects associated with Hadoop.

From here, Yahoo! took what Cutting and Cafarella had created with Nutch and pulled out the storage and processing parts of the framework to create Hadoop – keeping the Nutch web crawler as a separate project in its own right.

This was not the end though. Even in 2006, after the Hadoop project had been properly established, it was not able to handle the search workloads required of it.

According to Stata it was a “slow march” to grow Hadoop’s scalability and Yahoo! invested heavily, pouring as many resources into the project as possible.

During the development of Hadoop, Yahoo! setup a ‘research grid’ for the company’s data science team. This allowed the data scientists to record interesting research results and prototype new applications, which would improve Yahoo!’s search relevance and advertising revenue. On the back of the results and insights the data scientists were seeing, Yahoo merged its search and advertising into one unit so that Yahoo!’s sponsored search business could benefit from the new technology.

It was around 2008 that the Hadoop infrastructure was said to have been established and implemented across Yahoo!’s operations properly. By 2011, Yahoo!’s Hadoop infrastructure is said to have consisted of 42,000 nodes and hundreds of petabytes of storage.

The rise of Hortonworks

In 2011 Yahoo! made the decision to spin out their work on Hadoop and create a new, separate company called Hortonworks. It was setup as its own self-sustaining, Hadoop-specific software company – keeping it very much separate from Yahoo!’s other products and services.

Hortonworks was co-founded by Eric Baldeschwieler, the VP of Hadoop software development at Yahoo!. Bringing with him most of the Hadoop team (from Yahoo!) to help form the company, Hortonworks would effectively take over Yahoo!’s unofficial role as ‘supervisor’ of the Apache Hadoop project.

It has been said (by its founding members) that Hortonworks was so much more than a money-making opportunity – the ideology behind it was to create a win-win for Yahoo! and the entire Hadoop-using world.

You see, while Yahoo! was the main contributor to Hadoop, there was a danger of this Hadoop project getting in the way of Yahoo!’s core services. Hortonworks was a way to continue building Hadoop software and also sell Hadoop services, without diluting or moving away from Yahoo!’s product portfolio.

Remaining open-source, Hortonworks has proven to be a willing partner to many IT vendors trying to build Hadoop businesses of their own and works closely with companies such as Microsoft, HP, SAP and Teradata to develop their internal Hadoop knowledge and create products with Hortonworks’ technology at the core.

So when did Yahoo! open Hadoop up to the world?

The simple answer to the question is “never” – because it has always been ‘open’ to the world.

This is one misconception I have heard/read about regarding Yahoo!s involvement in Hadoop: that Yahoo! closed the doors on Hadoop, while the company was developing it and later released it to the public. Thankfully this is not a widely held belief, but it’s good to clear it up anyway.

Since its beginnings, Hadoop has always been (and will be, for the foreseeable future) an open-source project: from the origins in Nutch, to Yahoo!s ongoing involvement. Yahoo! has never “closed the doors” or squirreled the development of Hadoop away so that only its internal team could work on it.

Yes, Yahoo! has been, arguably, the largest contributor to the project and is credited for establishing it as a viable platform for storing and processing big data, but it has always been available for everyone (individuals and companies) to contribute to and utilise – that’s the whole purpose of the Apache Hadoop foundation.

And that is the beauty of Hadoop. There is no single corporation or entity controlling Hadoop – it’s open to everyone to use and help develop new technologies to use within the infrastructure. Yes, there are companies who sell Hadoop services, but they don’t ‘control’ or own Hadoop – they are there to offer a service and help those companies who could benefit from Hadoop, but don’t have the in-house skills or technology to setup a Hadoop cluster.


And that leads on to questions you often hear from some companies or individuals: “Why is Hadoop needed?” or “What’s wrong with using our traditional database/storage methods.”

Hadoop isn’t for everyone. Not every company in the world needs to jump on the Hadoop train, but on the flip side there are plenty of companies who can benefit from Hadoop, but don’t know why it’s needed.

And that is something we aim to answer in the next part of our blog series, “The evolution of data”, where we will look at exactly why Hadoop is required these days and some of its benefits.

Part one: introducing Hadoop | Part three: the evolution of data

Liked this? You may also be interested in: