Sunday, May 15, 2022

Introduction To Spark

 spark is fast general-purpose distributed data processing engine compatible with Hadoop data. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

that why it is widely used for big data workloads.


Spark is written in Scala but provides rich APIs for  Scala, Java,  Python, and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. It can be integrated with Hadoop and can process existing Hadoop HDFS data.

 Resilient Distributed Dataset – RDD: 

 RDD is an acronym for Resilient Distributed Dataset. It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements across cluster nodes. Also performs parallel operations. Moreover, Spark RDDs are immutable in nature. Although, it can generate new RDD by transforming existing Spark RDD


 Lazy Evaluation: 

Spark Lazy Evaluation means the data inside RDDs are not evaluated on the go. Basically, only after an action triggers all the  computation is performed. Therefore, it limits how much work it has to do.

No comments:

Post a Comment