Apache Spark

4 min readApr 12, 2023

This is going to be the best blog for understanding apache spark ever. After reading this blog you are gonna say Oh AM GEE Apache Spark is so dam simple.

Let get into it. Did you know, we are living in a qurtilians bytes of data? We have moved forward from gigabyte or terabytes of data. This huge amount of data is not stored in a normal database. We use databases like Apache Kafka, Snowflake, Amazon Redshift and more to store and retrieve data. So, What if we want to process some machine learning algorithms on this huge amount of data or want to make our own Chat GPT. Yes, you are right. It is not an easy job with a single computer, to send and retrieve this huge data by one system (computer) and then process it to send back to data warehouse. Trust me it is not easy job because of large amount of data if we assume that our system is reliable so we must ensure that our system must contains 1000s of GB RAM and 10000s GB of Storage plus super cool room temperature and many more factor for this huge amount of data. Here comes the concept of Cluster Computing.

What are clusters ?

Clusters are group of nodes( group of computers ) that are linked with each other to perform the divided task independently. So, either by doing the work with the single computer that is not doubt is impossible for now, we prefer to use clusters to shared the work in different computers that are linked together to retrieve the data, process it and then send back to the data warehouse. The technique/technology we use to build our cluster to perform processing on big data is know as Apache Spark.

Before the apache spark, we commonly use map reduce to perform sending and retrieval of big data from the data warehouse or data lake. But there are some cons of Map Reduce like it is slow in processing, hard to read the data, and fail for real-time processing. Then the king came out into the market, named Spark that it accuired by Apache foundation and then this framework becomes opensource for general public.

Spark is created under the technology of HDFS (Hadoop Distributed File System) that is nothing but the clustering of computers. It is very compatible with hadoop. Spark contains three components that works under the spark:

Spark Core
Spark SQL
Spark Streming

Spark Core:

This is the underlying engine that provides data processing capabilities. It consists of an execution engine, an optimizer, and a data storage system. The core enables the execution of user-defined functions on distributed data and provides the necessary support for fault tolerance and resource management.

Spark SQL:

This component provides a SQL-like interface to query and process data stored in HDFS. It is designed to be compatible with other popular SQL systems, like Hive and Impala.

Spark Streaming:

This component allows developers to create applications that process real-time data streams. It enables the processing of data streams from sources like Kafka and Flume.

What do you mean by realtime processing?

There are two type of processing uopn the data :

Batch Processing
Real-Time Processing

Batch Processing :

In Batch processing, data is retrieving in the form of chunks. We get the data pack by pack, not in continuous manner. Don’t worry i will elaborate it with example.

Real-time Processing:

In Real-time processing, Data is retrieving in a continues manner without any breakage of flow.

Example:

If we go to the well to get a bucket of water for shower then this is called batch processing. Buf if you connect the well with your house by using pipes then we get water continuously, non stop this is called as real-time processing.

Apache Spark support real-time processing as compared to Map Reduce in better and efficient way.

Remember: Real-time processing is faster than the batch processing because in batch processing we have a group of data in we alteration takes time as compared to real-time, that does not exist in a compile form.

For more such content follow me: (Muhammad Rameez)