spark is fast general-purpose distributed data processing engine compatible with Hadoop data. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.
It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
that why it is widely used for big data workloads.
Spark is written in Scala but provides rich APIs for Scala, Java, Python, and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. It can be integrated with Hadoop and can process existing Hadoop HDFS data.
Resilient Distributed Dataset – RDD:
RDD is an acronym for Resilient Distributed Dataset. It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements across cluster nodes. Also performs parallel operations. Moreover, Spark RDDs are immutable in nature. Although, it can generate new RDD by transforming existing Spark RDD
Spark Lazy Evaluation means the data inside RDDs are not evaluated on the go. Basically, only after an action triggers all the computation is performed. Therefore, it limits how much work it has to do.
Apache Spark is an open-source, distributed data
processing framework designed for processing large-scale, fault-tolerant, and
high-performance data workloads.
It is
widely used in modern data platforms for:
- Batch Processing – large volumes of historical
data
→ Using Spark Core and Spark SQL - Stream Processing – real-time data streams
→ Using Structured Streaming (built on Spark SQL) - ETL (Extract, Transform, Load) – data preparation and
transformation
→ Using Spark SQL, DataFrame API, and Datasets - Machine Learning – scalable ML pipelines
→ Using MLlib - Analytics – interactive SQL and
reporting
→ Using Spark SQL - Graph Processing – relationship and network
analysis
→ Using GraphX
What are the main components of Spark?
Answer:
The core components of Apache Spark are:
1.
Spark Core
– basic functionality and task scheduling
2.
Spark SQL
– structured data processing
3.
Spark Streaming
– real-time data processing
4.
MLlib –
machine learning library
5.
GraphX –
graph processing
1. Batch Processing
Batch processing collects data over a
period of time and processes it all at once as a single "batch." This usually
happens on a schedule (e.g., every night at 2 AM).
·
How it works: Data is collected, stored, and then processed in
bulk.
·
Latency: High (minutes to days).
·
Best for: Massive
volumes of data where immediate results aren't needed.
·
Example: Calculating monthly payroll, generating daily sales reports,
or clearing bank transactions at the end of the day.
·
Tools: Apache
Spark (RDD/DataFrames), AWS Glue, Snowflake, dbt, Airflow.
2.
Streaming (Near Real-Time)
Streaming processes data records one
by one as they arrive. It’s designed for continuous flow.
·
How it works:
Instead of waiting for a "bucket" to fill up, the system processes
each "drop" of data as it falls.
·
Latency: Low
(seconds to milliseconds).
·
Best for:
Monitoring systems where a delay of a few seconds is acceptable.
·
Example: Tracking GPS locations of delivery drivers, updating a live
"trending topics" list on social media, or monitoring website traffic
logs.
·
Tools: Apache
Kafka, Spark Streaming, Amazon Kinesis, Google Pub/Sub.
Data skew:
Data skew occurs when one partitions
has significantly more data ,leading to performance bottlenecks during shuffle
operations like joins and aggregations
Where Data
Skew Commonly Happens
groupByKey ()
reduceByKey()
join()
repartition()
Aggregations
Window functions
Basically → Any shuffle operation.
Types of Data Skew
1 Key
Skew
One key dominates dataset.
2 Partition Skew
Uneven partition size.
3 Join Skew
One key in join causes explosion.
How to Fix Data Skew:
AQE → Enable
spark.sql.adaptive.skewJoin.enabled (auto splits skewed partitions in Spark
3+).
Broadcast Join → broadcast(small_df) to avoid shuffle.
Salting → Add random suffix to heavy join key to spread data.
Isolate Skew → Process heavy/NULL keys separately, then union.
Repartition → df.repartition(n, "col") to balance
partitions.
Typical Architecture:
1.
Data Ingestion
o
Kafka / API /
Files
2.
Processing
o
Spark SQL (batch)
o
Structured
Streaming (real-time)
3.
Storage
o
Data Lake
(Parquet, Delta)
o
Data Warehouse
4.
Orchestration
o
Airflow
Why Spark is Important for Data Engineers
✅
Distributed & Scalable
✅ In-memory processing (faster than Hadoop MapReduce)
✅ Unified engine (Batch + Streaming + ML)
✅ Strong ecosystem support
✅ Cloud-native compatibility
Partitions
A Partition is a logical chunk of your
data.
DE Perspective: This is the most critical
component for a DE. You use repartition() or coalesce() to control how data is
spread across the cluster. Correct partitioning prevents Data Skew.
Speculative Execution (The Backup Plan)
In a large cluster, sometimes a single
machine (Worker Node) becomes slow because of hardware issues, network
congestion, or a noisy neighbor. This is called a Straggler.
How it works: If Spark detects that a Task
is taking much longer than the average time of other tasks in the same stage,
it launches a "speculative copy" of that same task on a different
executor.
The Race: Spark now has two versions of
the same task running. Whichever one finishes first, Spark accepts that result
and kills the other one.
Settings: You turn this on using
spark.speculation = true.
What is check
pointing
Check
pointing stores intermediate results to reliable storage like:
.HDFS
.Amazon S3
Used in
Streaming
Fault tolerance
Spark
Core (Foundation / Execution Engine)
Spark Core is the base engine responsible
for:
·
Distributed task
scheduling
·
Memory management
·
Fault tolerance
·
I/O operations
·
Interaction with
storage systems (HDFS, S3, ADLS)
Key Internal Components:
Driver
·
Entry point of
the application
·
Builds the
execution plan (DAG)
·
Sends tasks to
executors
Executors
·
Run tasks in
parallel
·
Store
cached/shuffled data
·
Return results to
driver
Cluster Manager
·
Allocates CPU
& memory resources
·
Examples: YARN,
Kubernetes, Standalone
Data Engineering Importance:
·
Enables parallel
ETL processing
·
Handles
large-scale batch workloads
·
Ensures fault
tolerance using lineage
Spark
SQL (Structured Data Processing Layer)
Spark SQL processes structured and
semi-structured data using:
·
SQL queries
·
DataFrame API
·
Dataset API
Key Features:
·
Catalyst
Optimizer (query optimization)
·
Tungsten Engine
(memory & CPU optimization)
·
Hive integration
Data Engineering Use Cases:
·
Data cleaning and
transformation
·
Schema
enforcement
·
Working with
JSON, Parquet, ORC, Avro
·
Building curated
datasets for analytics
Example:
Raw logs → Transform → Store as Parquet in Data Lake
3
Structured Streaming (Real-Time
Processing)
Spark’s real-time processing engine built on
Spark SQL.
Features:
·
Micro-batch
processing
·
Event-time
processing
·
Watermarking
(handles late data)
·
Exactly-once
guarantees
Data Engineering Use Cases:
·
Kafka stream
processing
·
Real-time
dashboards
·
Fraud detection
·
IoT event
processing
Example pipeline:
Kafka → Spark Structured Streaming → Delta Lake
4
MLlib (Machine Learning Library)
Distributed ML library inside Spark.
From Data Engineering View:
·
Build feature
engineering pipelines
·
Prepare
large-scale training datasets
·
Support ML teams
with scalable preprocessing
Typically, data engineers focus more on data
preparation than model building.
5
GraphX (Graph Processing Engine)
Used for graph-based computations.
Use Cases:
·
Social network
analysis
·
Recommendation
engines
·
Fraud network
detection
Not common in standard ETL but useful in
advanced analytics.
6 Deployment & Cluster Management
Spark can run on:
·
Hadoop YARN
·
Kubernetes
·
Standalone
clusters
·
Cloud platforms
(EMR, Databricks)
Why Important for Data Engineers:
·
Resource tuning
(memory, partitions, cores)
·
Cost optimization
·
Scalability
planning
·
Production
deployment management
Q: What is the difference between a
Transformation and an Action?
·
Transformations: Functions that create a new RDD/DataFrame from an
existing one (e.g., map, filter, join). They are Lazy—Spark only records them in a
DAG.
·
Actions: Operations that trigger computation and return
results to the driver or write to storage (e.g., count, collect, write).
Q: What is a DAG (Directed Acyclic
Graph)?
It is a logical representation of the
sequence of transformations applied to the data. It allows Spark to optimize
the execution plan (e.g., pipelining transformations) before actually running
the tasks.
2.
Data Abstractions: RDD vs. DataFrame vs. Dataset
Q: Why choose DataFrames over RDDs?
- Optimization: DataFrames use the Catalyst
Optimizer and Tungsten execution engine, which optimize query
plans and memory usage. RDDs have no built-in optimization.
- Structure: DataFrames provide a
schema-based view (columns/rows), making them easier to use for SQL-like
operations.
Q: When would you still use an RDD?
- When you need low-level control
over data distribution or physical placement.
- When dealing with unstructured
data that doesn't fit a schema.
- When performing complex
functional programming that isn't easily expressed in SQL/DSL.
3.
Performance Tuning & Optimization (Critical for Data Engineers)
Q: How do
you handle Data Skew?
Data skew occurs when one partition has significantly more data than others,
causing one task to run for hours while others finish in seconds.
- Salting: Add a random "salt"
key to the join key to redistribute the data.
- Broadcast Join: If one table is small,
broadcast it to all nodes to avoid shuffles.
- Iterative Filtering: Filter out the skewed keys and
process them separately.
Q: What is
the difference between repartition() and coalesce()?
- repartition(): Can increase or decrease
partitions. It triggers a full shuffle, ensuring data is
distributed uniformly.
- coalesce(): Only used to decrease
partitions. It avoids a full shuffle by merging existing partitions,
making it much more efficient for reducing file counts after a filter.
Q: Explain
the Catalyst Optimizer.
It is the engine behind Spark SQL/DataFrames. It works in four phases:
1.
Analysis: Resolving table/column names.
2.
Logical
Optimization:
Applying rules like Predicate Pushdown (filtering data at the source)
and Projection Pruning (selecting only needed columns).
3.
Physical
Planning: Generating multiple physical plans
and picking the one with the lowest cost.
4.
Code
Generation: Generating Java bytecode for
execution.
4. Storage & Memory Management
Q: Cache
vs. Persist?
- cache(): Shortcut for
persist(StorageLevel.MEMORY_ONLY).
- persist(): Allows you to
specify the storage level (e.g., MEMORY_AND_DISK, DISK_ONLY, or
MEMORY_ONLY_SER to save space via serialization).
Q5 : What
are Broadcast Variables and Accumulators?
- Broadcast Variables: Read-only variables cached on
every machine rather than being sent with every task. Use this for small
lookup tables in joins.
- Accumulators: Write-only variables used for
counters or sums across the cluster (e.g., counting corrupted records).
Only the Driver can read the final value.
Structured Streaming
Q: What is
Watermarking?
Watermarking is a threshold used in streaming to handle late data. It
tells Spark how long to wait for late-arriving events before dropping them from
the state. For example, a 10-minute watermark means data arriving more than 10
minutes late will be ignored.
Q: Explain
Checkpointing.
Checkpointing saves the current state (offsets, metadata) to a fault-tolerant
storage like S3/HDFS. If the stream fails, it can restart from the exact point
it left off, ensuring exactly-once processing.
6.
Real-world Scenario / Coding
Q: "My Spark job is slow. How
do you debug it?"
1. Open Spark UI: Look for the longest-running Stage.
2. Check for Skew: If one task is taking much longer
than others, it's a skew issue.
3. Check for Spill: If you see "Disk Spill"
in the UI, it means your executors don't have enough RAM for the operation, and
data is being moved to disk (very slow).
4. Check Shuffle Read/Write: High shuffle indicates a need for Broadcast
Joins or better partitioning.
7.
My job failed with an OutOfMemory (OOM) error. How do you fix it?
Case A: Driver OOM
·
Cause: You
probably ran .collect() on a massive dataset. This sends all
the data from the workers to the single Driver machine.
·
The Fix: Don't
use .collect(). Write the data to a file (S3/HDFS)
instead, or increase spark.driver.memory.
Case B: Executor OOM
·
Cause: Your
partitions are too big (Data Skew), or you have too many concurrent tasks
running on one executor.
·
The Fix:
1. Increase Partitions: Use .repartition() to make the data chunks smaller.
2. Increase Memory: Adjust --executor-memory.
3. Reduce Cores: If you have 5 cores per executor, 5
tasks are fighting for the same RAM. Reducing cores to 2 or 3 gives each task
more "breathing room."
No comments:
Post a Comment