Big changes are coming in the New Year with the launch of our new brand!   - If you have any questions in the mean time, drop us a line!

+44 (0)330 313 1000

by Gemma Stannard
on 03 December 2014

Digital Contact's big data glossary

In the ever-increasing world of technology and big data it is becoming more important to know and understand the words that big data companies are using every day.

At Digital Contact we found that there are very few big data glossaries freely available and none of them appear to be fully comprehensive, or updated regularly enough.

We thought, why not make our own? So we did.

Using our own knowledge of traditional big data terms and double-checking those used by some of biggest contributors to big data in recent years, we bring you our very own ‘Big data glossary’ – an A to Z of commonly used big data terminology.

The full glossary is available to download for free, no log-in details required and it will be updated as regularly as possible.

Below is just a small sample of what you can expect from the glossary, featuring examples of big data terms from each letter of the alphabet:


The process of either encrypting or removing any links between individuals and their data, so that the data is made anonymous.

A unit of digital information equivalent to one quadrillion terabytes, or a 1 with 27 digits behind it.

Clickstream Analytics
The tracking and analysis of users’ visits and clicks on websites. All actions are logged so that the data can be used for customer research and website testing.

Data Cleansing
The process of removing or amending incomplete, inaccurate, corrupt, duplicated or improperly formatted data from a database.

A unit of digital storage equivalent to approximately one billion gigabytes, or one thousand petabytes.

A computer system designed to continue fully functioning if there is a fault in one or more components.

The concept of using game mechanics and elements to engage and motivate people in a non-gaming context. Usually used in online marketing to encourage engagement, create brand awareness and collect data.

A free, open source software project from Apache, which enables the distributed processing and storage of large datasets across clusters of servers. It is widely used by companies that deal with big data, such as Google, IBM, Yahoo! And Digital Contact.

In-database Analytics
An analytics method for processing data within a database to eliminate the time and cost of moving data to an analytics application.

A standard, easy to use way to store data in human-readable text. It is mainly used to exchange data between a server and web application. Short for JavaScript Object Node.

Key/Value Database
A simple database management system where piece of data is stored with a unique primary key to make it easier and quicker to look up.

Legacy System
A computer system, application or technology that is outdated or not supported any more, but is still used as there is no newer, better alternative or is too expensive to replace.

Mechanical Turk
Amazon’s crowdsourcing marketplace that allows companies to integrate human intelligence into their applications to perform tasks that computers aren't able to do, for example product categorisations or advanced facial recognitions.

The centre of an HDFS file system, which holds the directory tree of all files in the file system

An outlier refers to a point of data within a dataset that largely deviates from the rest of the data. These can indicate that further analysis is required or that it is an anomaly.

Predictive Modelling
The process of developing a model based on predictive analysis that can predict probabilities of future trends or outcomes.

Query Analysis
Analysing a query so it can be optimised for the fastest and best possible result, and the least resource cost.

Random Access Memory (RAM)
The memory in a computer where data currently in use is kept, such as open programs and the operating system, because it is much faster to read and write from than the hard drive. Data will only remain on there for as long as the program or application is running.

Sentiment Analysis
The process of measuring people’s opinions on the web using natural language processing and text analytics, usually used to track the reputation of a company, brand, product or event. Also referred to as opinion mining.

Transactional Data
Reference data directly generated as a result of a transaction, usually in the fields of finance, business or logistics. Data that is logged may include prices, times, places or quantities.

Unstructured Data
Data that is irregular and has no identifiable structure, typically text, or a mixture of text, numbers and dates. This is much more difficult to process than data already defined and organised in a database.

Variable Pricing
Using real-time data monitoring to adjust product pricing by looking at customer supply and demand.

Web Hosting Service
A service that provides clients with internet connectivity and server space within a data centre to make their website accessible online.

A language that allows information to be shared in a consistent way by defining it in a tag structure with relationship rules, similar to HTML. It was designed for web pages to be able to display correct content from any XML compatible application. Short for Extensible Markup Language.

Yahoo! Pipes
A web application by Yahoo! that allows users to aggregate content from around the web to create a custom feed, then publish it as a web-based application.

A unit of digital storage equivalent to 1000 exabytes. All of the data traffic in the world can currently fit into a zettabyte, and is expected to surpass this in 2016.

If you feel we have missed terms that you have heard about, or feel should be added to the complete glossary, feel free to contact:, or

To keep up-to-date with our newest blog articles you can follow us on Facebook, Twitter or LinkedIn.

Liked this? You may also be interested in: