What is Hadoop? Part three: The evolution of data
Continuing our series, “What is Hadoop?”, in this blog post we will look at why Hadoop is needed and why so many companies are beginning to adopt the technology as part of their on-going strategies.
It’s a question I’m often asked when talking about Hadoop to people outside of Digital Contact: “Why is it even needed?”
It’s a fair question. These days, computers are basically useless without the ability to store the data that is created on them – so why do companies need something that is, essentially, a framework for storing/retrieving data?
In the beginning
When you hear the words “data storage” you would no doubt immediately think of PC hard drives, or banks of servers holding massive amounts of information for a large corporation. But data storage is so much more than that.
What we are talking about is the storage of information in a way that it is freely available for a user to access and retrieve, with little difficulty, at a later time.
Storing information is fundamental in human society, and it has been around for a very long time. I dare say, longer than you think…
In theory, the idea of data storage can be dated back some 40,000 years; the time of the earliest known cave drawings. But, there is some dispute here. Cave drawings are generally seen by many as a method of communication for early humans – nothing more. Though, others have gone on to see drawings as a means of ‘storing’ data; as the technology to create paper or even stone tablets didn’t exist – stories told about hunts were ‘stored’ on cave walls – allowing free access to anyone to read the stories told by the paintings.
However, even if we were to ignore cave paintings as genuine means of storing information, one of the earliest known methods of information storage dates back some 20,000 years – to the Upper Palaeolithic era.
During this time, tally sticks – long ‘sticks’, originally made of bone, with notches carved into them - were known to have been used as an ancient memory aid device; used to record and document numbers, quantities and messages.
Tally sticks have been used throughout history and even when other methods of communication and storage have come about, such as parchment and paper, the tally stick still remained as viable way to store information. One of the last known uses of tally sticks as an official record-keeping method, is as recent as 1826 A.D. by the Exchequer.
And there is the obvious, more traditional example, of information being stored and retrieved: libraries. Dating back as far as 2500 B.C., libraries have been a place where an organised collection of sources of information is made accessible to communities for reference.
Enough of the (ancient) history lesson
The reason I mention these ancient methods for data storage is to emphasise just how long humans have been creating records of important information. As technology progressed and human intellect expanded, more information has been gathered and stored; from the early cave drawings and tally sticks, to parchments, scrolls, books and now computers. And with each advancement, the volume created has jumped dramatically – which is where Hadoop steps in.
Simply put, Hadoop has become a necessity out of evolution.
It’s the evolution of computers, the way we work in our everyday lives and (most significantly) the way data creation has evolved.
As mentioned, data storage is essential for all modern computers, but the earliest known (analogue) computers, dating back as far as Ancient Greece, were only able to compute information, not store it directly; it was up to the user to make a note of any information generated by such devices.
In 1946 Alan Turing presented a paper on the Automatic Computing Engine (ACE), giving the first reasonably complete design of a stored-program computer. It wasn’t until 21 June, 1948 that the world’s first stored-program computer, nicknamed “baby”, ran its first program.
But you’d have to wait for the 1960s to come along to see ‘modern style’ computers – containing integrated circuit boards and central processing units (CPUs) – being manufactured. By the 1970s, people in academic or research and scientific institutions had regular access to single-person use computers (a system which just person could easily operate the computer alone) – however these systems were still too expensive for home or general office use.
If you go back 20-30 years, when computers use in offices became more prevalent, data was originally being generated and accumulated by the office employees/workers; inputting data into systems and databases. Companies would take that information stored on their systems and use that data to their own gain – there was little-to-no sharing of files and systems between individuals and corporations.
Things progressed with the dawn of the Internet and businesses soon found that they could easily share information around the world to their global offices, external companies and people. The data being generated by employees was spreading faster and farther than before.
The explosion of data generation
It’s after the start of 1980s that we see the truly devastating speed at which data generation has evolved. After the Internet was established as being a fantastic tool for businesses and individual users, there have been two key moments where data generation has effectively ‘exploded’:
As the Internet developed, ‘external users’ (everyday people) were able to easily generate data, and companies no longer relied solely on employees: this was the dawn if what is now known as Web 2.0.
The boom in number of websites being created, online registration forms, social media sites, such as MySpace, Bebo, then with Facebook and Twitter, and blog posts… basically anything created online by an individual that we take for granted now really helped boost the amount of data being generated by magnitudes.
Websites were no longer just being created by companies, for individuals to passively use, they were becoming interactive. It was becoming easy for anybody, with no need for coding or in-depth technical knowledge, to upload and generate content for others to consume.
And with the advancements in mobile technology (such as smart phones and tablets), user-generated content has rapidly evolved to dizzying heights in recent years.
Rise of the machines
We have since entered another era of data generation, one that most people don’t take into consideration, and it’s occurring every day: machine data generation.
Buildings and offices use computer systems to monitor our environments, from temperature/humidity, electricity usage and security cameras recording information, to hospital equipment recording vitals and satellites orbiting the Earth - monitoring changes in climate and weather patterns on a global scale – all of these are machines accumulating and generating data in offices and even homes today.
But the key point to remember in all of this is that, while individuals are said to now generate more than 80% of all data in the world, it’s still up to businesses to collect and store all that data.
And that is something that wasn’t anticipated back in the 1980s, when the Internet was first established.
But have you tried…?
Another question I’ve been asked when talking about Hadoop is: “So why don’t companies just buy more memory to store the data? Surely that’s easier than implementing a Hadoop cluster?”
Memory is just one element you have to factor in with data evolution. Up until recently, companies pretty much just deposited information in large databases/servers – essentially a giant hard drives that data can be stored and accessed on.
It sounds simple to say: “Just buy more storage for the server” – job done right?
Well yes and no. But mostly no.
Certainly, buying more storage will allow a company to collect more information and it is possible to retrieve that information too.
However, this idea all unravels when you factor in:
- Cost: upgrading servers is costly. The additional memory itself is pricey and you can’t just slap in more memory - it’s a trickier process than that
- Efficiency: the larger the server memory capacity, the slower and harder it becomes to retrieve that data – unless you pay extortionate amounts of money for better processing speed/power to compensate for any decrease in efficiency
- Poor fault-tolerance: if a server is corrupted/breaks then retrieving the data can be almost impossible at times. Thankfully it’s uncommon, but should something go wrong, then it means spending more time, effort and money rectifying the situation
- Data structure: one of the ‘Three Vs’ of big data is ‘Variety’. A lot of the data gathered from the Internet is a mix of structured and unstructured data. Traditional databases can store this unstructured information, but retrieving and making sense of it becomes more complicated – and, again, more costly.
Traditional servers/databases act like giant personal computers. Yes, you can interlink multiple machines to access the same data but, at the end of the day, because of the way the data is processed, you essentially have a PC system with a very big hard drive.
The problem with this ‘giant PC’ is that the more memory/data stored you have, the more processing power is required. It gets very expensive, very quickly.
The need for Hadoop
We are at a stage where colossal amounts of data are being generated on a daily basis. This rapid growth of data has changed how the data needs to be handled and managed.
If you go back to the start of the data evolution process, when data was just being generated by employees, we used to bring the data created to the computer’s CPU, which processed and managed the information, delivering results that were easy for humans to read and understand.
But now there is an enormous amount of data and a single CPU physically cannot process the information, because there is simply too much of it.
Here’s an example just to emphasise the size of data and the evolution we are experiencing: In 1999 the average hard drive size (storage capacity) created by manufacturers was between 5GB to 10GB*.
[*Note: there were much larger hard drives available around that time, however 5-10GB was designated a ‘sweet spot’ where ‘dollars-per-megabyte’ and supply costs converged to create the most popular configurations.]
Imagine taking an old PC from 1999 with a 10GB hard drive and trying to load onto it all the photos and apps from your smartphone – there’s a very good chance that you physically wouldn’t be able to do it. Nowadays, average smartphone capacity currently ranges between 8GB to 32GB – up to more than three times the capacity of a desktop PC 15 years ago.
Now try to take the photos and apps from all your friends’ phones too and put them all onto that same, single 1999 PC at once. It would be physically impossible.
Those are just smart phones – small, pocket-sized devices that, while very advanced technology, pale in comparison to the average storage capacities of modern notebooks (500GB) and desktop PCs (1TB).
In 15 years the storage capacity of PCs has increased 100x over systems in 1999 – and that’s because the amount of data we create and consume has increased in a similar way.
This is a very comparable situation companies were soon faced with when the Internet started to take off and data was being generated by users – the older servers and PCs that data was being stored on could no longer handle the sheer volume and speed of all the data being thrown at it.
More storage space could be bought, allowing companies to hold the information, however it soon became very inefficient and expensive to process that data, because there was too much information and it was just collecting on storage drives.
Hadoop became a solution to these problems. It allowed companies the ability to not only store the vast amounts of data being generated, but to also process and handle the information in an efficient and manageable way.
How does Hadoop achieve this - where traditional databases/server setups have failed? Well that is something that we will answer in our next blog post: “The heart of Hadoop”.
Part two: the brief(ish) history of Hadoop | Part four: the heart of Hadoop - coming soon