Some Terminalogy use in spark
Resilient Distributed Dataset – RDD:
RDD is an acronym for Resilient Distributed Dataset. It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements across cluster nodes. Also performs parallel operations. Moreover, Spark RDDs are immutable in nature. Although, it can generate new RDD by transforming existing Spark RDD
Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.
DataFrame : It works only on structured and semi-structured data. Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. DataFrames allow the Spark to manage schema.
DataSet : it is an extension of dataframe API, which provides the functionality of type-safe, object-oriented programming interface of the RDD API It also efficiently processes structured and unstructured data
There are 3 ways of creating an RDD:
1)Parallelizing an existing collection of data
2)Referencing to the external data file stored
3)Creating RDD from an already existing RDD
Spark Session:
SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark functionality and
allows programming Spark with DataFrame and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.
In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.
SparkContext:
SparkContext is the entry gate of Apache Spark functionality. The most important step of any Spark driver application is to generate SparkContext. It allows your Spark Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos). To create SparkContext, first SparkConf should be made. The SparkConf has a configuration parameter that our Spark driver application will pass to SparkContext.
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node.
Once the SparkSession is instantiated, we can configure Spark’s run-time config properties
--driver-memory : The --driver-memory flag controls the amount of memory to allocate for a driver, which is 1GB by default and should be increased in case you call a collect() or take(N) action on a large RDD inside your application
setting it in the properties file (default is $SPARK_HOME/conf/spark-defaults.conf).
---job submmit----
--class org.apache.spark.examples.SparkPi --master yarn
--deploy-mode client --driver-memory 4g --num-executors 2 --executor-memory 2g
--executor-cores 2 /opt/apps/spark-1.6.0-bin-hadoop2.6/lib/spark-examples*.jar 10
No comments:
Post a Comment