Big Data? what’s that?

Deepak
3 min readSep 17, 2020

--

Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time.

This data is generated at every moment when ever you click ,search 🔎, comment 💬.Whatever you do on website its recorded and that Huge about of raw data is generated

Type of Data Generated

  1. Structured data
    Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in a table with rows and columns.
  2. Semi-Structured data
    Semi-structured data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database
  3. Unstructured data
    Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model, thus it is not a good fit for a mainstream relational database. Analysing this type of data is very Difficult

Big Data Statistics

  • Unstructured data is a problem for 95% of businesses.
  • People are generating 2.5 quintillion bytes of data each day.
  • By 2023, the big data industry will be worth an estimated $77 billion.
  • Nearly 90% of all data has been created in the last two years.

Big Data is not only BIG, it is more complex than it Look

Big data analytics is through the concept of four V’s:

  • Volume: Big data is just big.
  • Variety: Big data is highly varied and diverse.
  • Velocity: Big data is growing at exponential speed.
  • Veracity: The accuracy of big data can vary greatly.

Solution?

Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset.

Doug Cutting and his team developed an Open Source Project called HADOOP.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters(thousands of nodes) of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Who Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines.

  • Input data is broken into blocks of size 128 Mb and then blocks are moved to different nodes.
  • Once all the blocks of the data are stored on data-nodes, the user can process the data.
  • Resource Manager then schedules the program (submitted by the user) on individual nodes.
  • Once all the nodes process the data, the output is written back to HDFS

Conclusion

Hadoop is the solution of most of the problem but it has many more problem that is cannot be solved .As a single Software cannot solve every issue of your problem

--

--

Deepak
Deepak

Written by Deepak

RHEL8 | AWS | Ansible | Docker| K8s | Jenkins

No responses yet