Where does Big Data go and how is it stored? Where does Big Data go and how is it stored? hadoop hadoop

Where does Big Data go and how is it stored?


When we talk about Big Data, we talk in most cases about huge amount of data that is many cases constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.

What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is to costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.

You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.

If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.

Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with a huge amount of data.

Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.

Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).

This is my Big Data in a nutshell for people who are experienced with traditional database systems.


Big data, simply put, is an umbrella term used to describe large quantities of structured and unstructured data that are collected by large organizations. Typically, the amounts of data are too large to be processed through traditional means, so state-of-the-art solutions utilizing embedded AI, machine learning, or real-time analytics engines must be deployed to handle it. Sometimes, the phrase "big data" is also used to describe tech fields that deal with data that has a large volume or velocity.

Big data can go into all sorts of systems and be stored in numerous ways, but it's often stored without structure first, and then it's turned into structured data clusters during the extract, transform, load (ETL) stage. This is the process of copying data from multiple sources into a single source or into a different context than it was stored in the original source. Most organizations that need to store and use big data sets will have an advanced data analytics solution. These platforms give you the ability to combine data from otherwise disparate systems into a single source of truth, where you can use all of your data to make the most informed decisions possible. Advanced solutions can even provide data visualizations for at a glance understanding of the information that was pulled, without the need to worry about the underlying data architecture.