Data skew occurs when data is not evenly distributed across partitions in a distributed computing system like Spark
Data partitioning to optimize read and write. Caching frequently used data. Broadcast joins to optimize joins with small tables. Reducing shuffles by controlling partition sizes and join strategies.
A Directed Acyclic Graph (DAG) represents the sequence of stages in a computation. In Spark, a DAG visualizes the dependencies between different tasks and helps Spark optimize execution by determining the best task order and handling fault tolerance.
Common transformations include data cleansing, filtering, normalization, feature engineering, and aggregations.