Sunday, May 15, 2022

Introduction To Spark

 spark is fast general-purpose distributed data processing engine compatible with Hadoop data. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

that why it is widely used for big data workloads.


Spark is written in Scala but provides rich APIs for  Scala, Java,  Python, and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. It can be integrated with Hadoop and can process existing Hadoop HDFS data.

 Resilient Distributed Dataset – RDD

 RDD is an acronym for Resilient Distributed Dataset. It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements across cluster nodes. Also performs parallel operations. Moreover, Spark RDDs are immutable in nature. Although, it can generate new RDD by transforming existing Spark RDD


 Lazy Evaluation

Spark Lazy Evaluation means the data inside RDDs are not evaluated on the go. Basically, only after an action triggers all the  computation is performed. Therefore, it limits how much work it has to do.



Apache Spark is an open-source, distributed data processing framework designed for processing large-scale, fault-tolerant, and high-performance data workloads.

It is widely used in modern data platforms for:

  • Batch Processing – large volumes of historical data
    → Using Spark Core and Spark SQL
  • Stream Processing – real-time data streams
    → Using Structured Streaming (built on Spark SQL)
  • ETL (Extract, Transform, Load) – data preparation and transformation
    → Using Spark SQL, DataFrame API, and Datasets
  • Machine Learning – scalable ML pipelines
    → Using MLlib
  • Analytics – interactive SQL and reporting
    → Using Spark SQL
  • Graph Processing – relationship and network analysis
    → Using GraphX

What are the main components of Spark?

Answer:
The core components of Apache Spark are:

1.      Spark Core – basic functionality and task scheduling

2.      Spark SQL – structured data processing

3.      Spark Streaming – real-time data processing

4.      MLlib – machine learning library

5.      GraphX – graph processing

 

1. Batch Processing

Batch processing collects data over a period of time and processes it all at once as a single "batch." This usually happens on a schedule (e.g., every night at 2 AM).

·         How it works: Data is collected, stored, and then processed in bulk.

·         Latency: High (minutes to days).

·         Best for: Massive volumes of data where immediate results aren't needed.

·         Example: Calculating monthly payroll, generating daily sales reports, or clearing bank transactions at the end of the day.

·         Tools: Apache Spark (RDD/DataFrames), AWS Glue, Snowflake, dbt, Airflow.

2. Streaming (Near Real-Time)

Streaming processes data records one by one as they arrive. It’s designed for continuous flow.

·         How it works: Instead of waiting for a "bucket" to fill up, the system processes each "drop" of data as it falls.

·         Latency: Low (seconds to milliseconds).

·         Best for: Monitoring systems where a delay of a few seconds is acceptable.

·         Example: Tracking GPS locations of delivery drivers, updating a live "trending topics" list on social media, or monitoring website traffic logs.

·         Tools: Apache Kafka, Spark Streaming, Amazon Kinesis, Google Pub/Sub.


 

Data skew:

Data skew occurs when one partitions has significantly more data ,leading to performance bottlenecks during shuffle operations like joins and aggregations

Where Data Skew Commonly Happens

  groupByKey ()
  
reduceByKey()
  
join()
  
repartition()
 Aggregations
 Window functions

Basically → Any shuffle operation.

Types of Data Skew

1  Key Skew

One key dominates dataset.

2 Partition Skew

Uneven partition size.

3 Join Skew

One key in join causes explosion.


 

How to Fix Data Skew:

   AQE → Enable spark.sql.adaptive.skewJoin.enabled (auto splits skewed partitions in Spark 3+).

  Broadcast Join → broadcast(small_df) to avoid shuffle.

  Salting → Add random suffix to heavy join key to spread data.

  Isolate Skew → Process heavy/NULL keys separately, then union.

  Repartition → df.repartition(n, "col") to balance partitions.

 


 

Typical Architecture:

1.     Data Ingestion

o    Kafka / API / Files

2.     Processing

o    Spark SQL (batch)

o    Structured Streaming (real-time)

3.     Storage

o    Data Lake (Parquet, Delta)

o    Data Warehouse

4.     Orchestration

o    Airflow

 

Why Spark is Important for Data Engineers

✅ Distributed & Scalable
✅ In-memory processing (faster than Hadoop MapReduce)
✅ Unified engine (Batch + Streaming + ML)
✅ Strong ecosystem support
✅ Cloud-native compatibility

Partitions

A Partition is a logical chunk of your data.

DE Perspective: This is the most critical component for a DE. You use repartition() or coalesce() to control how data is spread across the cluster. Correct partitioning prevents Data Skew.


 

Speculative Execution (The Backup Plan)

In a large cluster, sometimes a single machine (Worker Node) becomes slow because of hardware issues, network congestion, or a noisy neighbor. This is called a Straggler.

How it works: If Spark detects that a Task is taking much longer than the average time of other tasks in the same stage, it launches a "speculative copy" of that same task on a different executor.

The Race: Spark now has two versions of the same task running. Whichever one finishes first, Spark accepts that result and kills the other one.

Settings: You turn this on using spark.speculation = true.

What  is check pointing

Check  pointing stores intermediate results to reliable storage like:

.HDFS

.Amazon S3

Used in

Streaming

Fault tolerance 


 

Spark Core (Foundation / Execution Engine)

Spark Core is the base engine responsible for:

·         Distributed task scheduling

·         Memory management

·         Fault tolerance

·         I/O operations

·         Interaction with storage systems (HDFS, S3, ADLS)

Key Internal Components:

Driver

·         Entry point of the application

·         Builds the execution plan (DAG)

·         Sends tasks to executors

Executors

·         Run tasks in parallel

·         Store cached/shuffled data

·         Return results to driver

Cluster Manager

·         Allocates CPU & memory resources

·         Examples: YARN, Kubernetes, Standalone

Data Engineering Importance:

·         Enables parallel ETL processing

·         Handles large-scale batch workloads

·         Ensures fault tolerance using lineage



 

Spark SQL (Structured Data Processing Layer)

Spark SQL processes structured and semi-structured data using:

·         SQL queries

·         DataFrame API

·         Dataset API

Key Features:

·         Catalyst Optimizer (query optimization)

·         Tungsten Engine (memory & CPU optimization)

·         Hive integration

Data Engineering Use Cases:

·         Data cleaning and transformation

·         Schema enforcement

·         Working with JSON, Parquet, ORC, Avro

·         Building curated datasets for analytics

Example:
Raw logs → Transform → Store as Parquet in Data Lake


3  Structured Streaming (Real-Time Processing)

Spark’s real-time processing engine built on Spark SQL.

Features:

·         Micro-batch processing

·         Event-time processing

·         Watermarking (handles late data)

·         Exactly-once guarantees


 

Data Engineering Use Cases:

·         Kafka stream processing

·         Real-time dashboards

·         Fraud detection

·         IoT event processing

Example pipeline:
Kafka → Spark Structured Streaming → Delta Lake


4  MLlib (Machine Learning Library)

Distributed ML library inside Spark.

From Data Engineering View:

·         Build feature engineering pipelines

·         Prepare large-scale training datasets

·         Support ML teams with scalable preprocessing

Typically, data engineers focus more on data preparation than model building.


5  GraphX (Graph Processing Engine)

Used for graph-based computations.

Use Cases:

·         Social network analysis

·         Recommendation engines

·         Fraud network detection

Not common in standard ETL but useful in advanced analytics.


6  Deployment & Cluster Management

Spark can run on:

·         Hadoop YARN

·         Kubernetes

·         Standalone clusters

·         Cloud platforms (EMR, Databricks)

Why Important for Data Engineers:

·         Resource tuning (memory, partitions, cores)

·         Cost optimization

·         Scalability planning

·         Production deployment management

Q: What is the difference between a Transformation and an Action?

·         Transformations: Functions that create a new RDD/DataFrame from an existing one (e.g., map, filter, join). They are Lazy—Spark only records them in a DAG.

·         Actions: Operations that trigger computation and return results to the driver or write to storage (e.g., count, collect, write).

Q: What is a DAG (Directed Acyclic Graph)?

It is a logical representation of the sequence of transformations applied to the data. It allows Spark to optimize the execution plan (e.g., pipelining transformations) before actually running the tasks.


 

2. Data Abstractions: RDD vs. DataFrame vs. Dataset

Q: Why choose DataFrames over RDDs?

  • Optimization: DataFrames use the Catalyst Optimizer and Tungsten execution engine, which optimize query plans and memory usage. RDDs have no built-in optimization.
  • Structure: DataFrames provide a schema-based view (columns/rows), making them easier to use for SQL-like operations.

Q: When would you still use an RDD?

  • When you need low-level control over data distribution or physical placement.
  • When dealing with unstructured data that doesn't fit a schema.
  • When performing complex functional programming that isn't easily expressed in SQL/DSL.

3. Performance Tuning & Optimization (Critical for Data Engineers)

Q: How do you handle Data Skew? Data skew occurs when one partition has significantly more data than others, causing one task to run for hours while others finish in seconds.

  • Salting: Add a random "salt" key to the join key to redistribute the data.
  • Broadcast Join: If one table is small, broadcast it to all nodes to avoid shuffles.
  • Iterative Filtering: Filter out the skewed keys and process them separately.

 

Q: What is the difference between repartition() and coalesce()?

  • repartition(): Can increase or decrease partitions. It triggers a full shuffle, ensuring data is distributed uniformly.
  • coalesce(): Only used to decrease partitions. It avoids a full shuffle by merging existing partitions, making it much more efficient for reducing file counts after a filter.

Q: Explain the Catalyst Optimizer. It is the engine behind Spark SQL/DataFrames. It works in four phases:

1.     Analysis: Resolving table/column names.

2.     Logical Optimization: Applying rules like Predicate Pushdown (filtering data at the source) and Projection Pruning (selecting only needed columns).

3.     Physical Planning: Generating multiple physical plans and picking the one with the lowest cost.

4.     Code Generation: Generating Java bytecode for execution.


4. Storage & Memory Management

Q: Cache vs. Persist?

  • cache(): Shortcut for persist(StorageLevel.MEMORY_ONLY).
  • persist(): Allows you to specify the storage level (e.g., MEMORY_AND_DISK, DISK_ONLY, or MEMORY_ONLY_SER to save space via serialization).


 

Q5 : What are Broadcast Variables and Accumulators?

  • Broadcast Variables: Read-only variables cached on every machine rather than being sent with every task. Use this for small lookup tables in joins.
  • Accumulators: Write-only variables used for counters or sums across the cluster (e.g., counting corrupted records). Only the Driver can read the final value.

 Structured Streaming

Q: What is Watermarking? Watermarking is a threshold used in streaming to handle late data. It tells Spark how long to wait for late-arriving events before dropping them from the state. For example, a 10-minute watermark means data arriving more than 10 minutes late will be ignored.

Q: Explain Checkpointing. Checkpointing saves the current state (offsets, metadata) to a fault-tolerant storage like S3/HDFS. If the stream fails, it can restart from the exact point it left off, ensuring exactly-once processing.


6. Real-world Scenario / Coding

Q: "My Spark job is slow. How do you debug it?"

1.     Open Spark UI: Look for the longest-running Stage.

2.     Check for Skew: If one task is taking much longer than others, it's a skew issue.

3.     Check for Spill: If you see "Disk Spill" in the UI, it means your executors don't have enough RAM for the operation, and data is being moved to disk (very slow).

4.     Check Shuffle Read/Write: High shuffle indicates a need for Broadcast Joins or better partitioning.

7. My job failed with an OutOfMemory (OOM) error. How do you fix it?

Case A: Driver OOM

·         Cause: You probably ran .collect() on a massive dataset. This sends all the data from the workers to the single Driver machine.

·         The Fix: Don't use .collect(). Write the data to a file (S3/HDFS) instead, or increase spark.driver.memory.

Case B: Executor OOM

·         Cause: Your partitions are too big (Data Skew), or you have too many concurrent tasks running on one executor.

·         The Fix:

1.     Increase Partitions: Use .repartition() to make the data chunks smaller.

2.     Increase Memory: Adjust --executor-memory.

3.     Reduce Cores: If you have 5 cores per executor, 5 tasks are fighting for the same RAM. Reducing cores to 2 or 3 gives each task more "breathing room."







No comments:

Post a Comment