Wednesday, May 25, 2022

Spark Terminalogy

 Some Terminalogy use in spark


 Resilient Distributed Dataset – RDD: 

 RDD is an acronym for Resilient Distributed Dataset. It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements across cluster nodes. Also performs parallel operations. Moreover, Spark RDDs are immutable in nature. Although, it can generate new RDD by transforming existing Spark RDD

Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.


DataFrame :  It works only on structured and semi-structured data. Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. DataFrames allow the Spark to manage schema.


DataSet : it is an extension of dataframe API, which provides the functionality of type-safe, object-oriented programming interface of the RDD API  It also efficiently processes structured and unstructured data


There are 3 ways of creating an RDD:


1)Parallelizing an existing collection of data

2)Referencing to the external data file stored

3)Creating RDD from an already existing RDD



Spark Session:

SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark functionality and

allows programming Spark with DataFrame and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.

In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.


SparkContext:

SparkContext is the entry gate of Apache Spark functionality. The most important step of any Spark driver application is to generate SparkContext. It allows your Spark Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos). To create SparkContext, first SparkConf should be made. The SparkConf has a configuration parameter that our Spark driver application will pass to SparkContext.

sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node.

Once the SparkSession is instantiated, we can configure Spark’s run-time config properties 



--num-executors: Number of executors is the number of distinct yarn containers (think processes/JVMs) that will execute your application.

--executor-cores: Number of executor-cores is the number of threads you get inside each executor (container)


--executor-memory: An executor is a process that is launched for a Spark application on a worker node. Each executor memory is the sum of yarn overhead memory and JVM Heap memory. JVM Heap memory comprises of: RDD Cache Memory. Shuffle Memory.

--driver-memory :     The --driver-memory flag controls the amount of memory to allocate for a driver, which is 1GB by default and should be increased in case you call a collect() or take(N) action on a large RDD inside your application

setting it in the properties file (default is $SPARK_HOME/conf/spark-defaults.conf).



---job submmit----


--class org.apache.spark.examples.SparkPi --master yarn

--deploy-mode client --driver-memory 4g --num-executors 2 --executor-memory 2g

--executor-cores 2 /opt/apps/spark-1.6.0-bin-hadoop2.6/lib/spark-examples*.jar 10

No comments:

Post a Comment