How companies like Google store data…

Distributed Storage Cluster and Hadoop

Bhavesh Kakrotra
4 min readSep 17, 2020

It is said that the amount of data created in the world till 2003/2004, is exactly the same amount of data netizens all around the globe produce now in just two days. In the past decade, the internet data usage has been gaining surge with every passing day. With the rise of platforms like YouTube, Facebook, Instagram and any more; the use of internet and the requirement of data is on a powerful upward movement.

At the time of pandemic there has been a rise of 70% internet usage with everyone sitting at home attending their classes, meetings, enjoying playing multiplayer games and watching their favorite shows/movies on OTT video streaming services. More data is being generated as you are going through this story.

The Data Comparison Table

The table above tells us how big the amount of data storage world needs today can be and actually it is! The comparison goes way further for example, about a thousand YB is 1 brontobyte, then geobyte and beyond… but the data today is dealt within zettabytes and yottabytes.

In god we trust, everyone else bring data.

Hence the problem of Big Data arises…

Big Data is relative. For an infrastructure if the amount of data is big enough so that it can not fit their systems, that is big data for them. Also, what big data was five years ago is not big data today.

Big data comes with five main problems:

Volume:

Volume, also known as size, is the main reason for the occurrence of big data, which is pretty self explanatory.

Velocity:

The amount of data storage requirement is high which for obvious reasons can’t be dealt with our daily commodity hardware, due to durability and speed reasons. Also, one of the biggest in storage hard disk available is about 18TB. SSDs can be of much bigger size as much as 100TB, but still that very small when we consider the volume. The makers of storage devices too, do not make very large sized storage devices due to arise of a newer problem in Input-Output or I/O streams.

Variety:

Different types of data come may come from different or same platforms. For an example, we look at Facebook, its users almost make around four whopping petabytes of day per-day, which contains various types of data including likes, shares, comments, photos, videos, files and so much more.

Veracity:

The worthiness of data in terms of accuracy depends on source, and it is in question when your goal is to analyze the data.

Value:

Now, data is considered useful only if it has value. Millions and billions are spent by companies and just to fetch valuable data for purposes like advertisement and what not. Anyways this again is in question when your goal is to analyze and do something with that data.

Solving the issues of big data

To solve the issues of big data… the two biggest problems needed to be solved… volume and velocity.

To solve issues with big data, some genius minds came up with the concept of distributing the storage devices such that, the data is divided and distributed equally into different storage devices. This reduced the time taken in data transfers and took the transfer speeds to the next level.

Distributed storage topography

Lets suppose there is a file that takes 10 minutes to transfer. Now if we distribute the file equally and send it to 10 different storage devices simultaneously, the file will take only 1 minute to transfer. This came to be known as Distributed Storage. A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. This solves the issue of both volume and velocity.

Companies and industry use a technology called Hadoop to achieve this.

Hadoop uses a master-slave architecture for this purpose, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes).
It has a block structured file system (known as HDFS or Hadoop Distributed File System) where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines.

Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

Hadoop is build on top of Java, so it requires Java to run.
You can get it from here.

Feel free to contact on my LinkedIn.

--

--