Data Science Interview Questions for Cognizant

15/11/2024

By.Yogesh

What is data skew?

Data skew occurs when data is not evenly distributed across partitions in a distributed computing system like Spark

What optimization techniques are you using?

Data partitioning to optimize read and write. Caching frequently used data. Broadcast joins to optimize joins with small tables. Reducing shuffles by controlling partition sizes and join strategies.

What is a DAG?

A Directed Acyclic Graph (DAG) represents the sequence of stages in a computation. In Spark, a DAG visualizes the dependencies between different tasks and helps Spark optimize execution by determining the best task order and handling fault tolerance.

What transformations on data have you done?

Common transformations include data cleansing, filtering, normalization, feature engineering, and aggregations.