Data Science Interview Questions for Cognizant

Data Science Questions

1.What is data skew?

Data skew occurs when data is not evenly distributed across partitions in a distributed computing system like Spark. This can happen if some partitions hold a much larger amount of data than others, causing certain tasks to take significantly longer to process, while other tasks complete quickly. Data skew leads to inefficiency and longer processing times because one node ends up handling a disproportionately large amount of data. Techniques to address this include repartitioning, using salting keys, or adjusting join strategies.

2.What happens when worker nodes die?

In Spark or other distributed systems, if a worker node dies, the job does not immediately fail. Instead, the job scheduler reschedules the tasks that were running on the failed node to other active nodes in the cluster. This helps maintain fault tolerance, though it may slow down the job completion. The system might also automatically replace the failed node if configured with fault tolerance policies, like using auto-scaling in AWS EMR or similar cluster management settings.

3.How do you configure Spark for 1TB of data?

For a 1TB dataset, Spark configuration would involve tuning several parameters to optimize performance. Key settings include:

-spark.executor.memory to allocate sufficient memory to each executor.

-spark.executor.instances to increase the number of executors handling the data.

-spark.sql.shuffle.partitions to control the number of partitions created during shuffles (higher values may be required to distribute data evenly).

-spark.driver.memory to allocate enough memory to the driver if large aggregations are performed.

-Other considerations include enabling dynamic allocation and compression to manage memory efficiently.

4.What is best for ETL: EMR or Glue?

AWS Glue is a serverless ETL service designed for easy data integration. It’s ideal for regular ETL tasks and integrates with AWS data lakes and data warehouses. Glue is cost-effective because you only pay for job execution time.

>EMR (Elastic MapReduce) offers more flexibility and control over the environment. It is better for complex, custom ETL jobs or when managing big data frameworks like Spark, Presto, or Hive. EMR is useful for handling more complex workloads and when you need fine-grained control over the cluster’s infrastructure.

5.Which cluster manager do you use and why?

A common choice is YARN (Yet Another Resource Negotiator) in Hadoop ecosystems, as it supports job scheduling and resource management for Spark applications. Alternatively, Kubernetes is gaining popularity for its ability to manage containerized Spark jobs and provide scalability, ease of use, and high availability in production.

6.On which business problem have you worked?

I have worked on various business problems, such as customer segmentation, fraud detection, recommendation engines, and operational reporting. For example, in customer segmentation, we used data processing to analyze customer behavior and preferences, enabling targeted marketing and personalization strategies.

7.What types of requests do you get from BI and data science teams?

BI teams typically request data aggregation, filtering, and data cleansing for reporting dashboards. Data scientists often ask for feature engineering, data extraction, and formatting for model training. These teams may also request enriched datasets for analysis, custom transformations, or optimized data pipelines for faster insights.

8.In your database, what types of records or columns do you have?

Our database might contain transactional records, customer profiles, product catalogs, and logs. Columns could include IDs, timestamps, names, categories, numerical metrics, and other business-specific attributes.

9.How does agile methodology work in your organization?

We follow agile practices with sprint cycles, where each sprint typically lasts 1–2 weeks. Our team holds sprint planning, daily stand-ups, and retrospectives. Tasks are broken into user stories, prioritized in a backlog, and delivered incrementally, allowing us to adapt quickly to new requirements and ensure continuous improvement.

10.Where have you used stored procedures, views, and triggers?

Stored Procedures: Used for batch processing or repetitive database tasks that require complex logic.

Views: Used to simplify querying for BI reports by providing a specific perspective on the data.

Triggers: Utilized for enforcing data integrity and business rules, such as auditing changes in critical tables.

11.How much time is generally required for a job of having 1GB of data?

Processing 1GB of data can take anywhere from a few seconds to minutes, depending on the complexity of transformations and available cluster resources. Simple ETL jobs might complete quickly, while jobs involving joins or aggregations could take longer.

12.How many jobs are you currently maintaining?

This varies, but a typical data engineering role may involve maintaining dozens to hundreds of jobs, depending on data source volume, frequency of updates, and complexity of transformations.

13.Cluster mode vs. client mode in Spark?

Cluster mode: The Spark driver runs on the cluster, making it more resilient and suitable for production jobs where long-running tasks are common.Client mode: The driver runs on a local machine, which is helpful for interactive analysis and debugging but less resilient for large, distributed jobs.

14.What optimization techniques are you using?

Techniques include:

Data partitioning to optimize read and write.

Caching frequently used data.

Broadcast joins to optimize joins with small tables.

Reducing shuffles by controlling partition sizes and join strategies.

15.What errors have you faced in Glue jobs?

Common errors include out-of-memory issues, schema mismatches, and S3 connectivity issues. These are often resolved by adjusting memory allocation, validating schemas, and ensuring network configurations are correct.

16.What is a DAG?

A Directed Acyclic Graph (DAG) represents the sequence of stages in a computation. In Spark, a DAG visualizes the dependencies between different tasks and helps Spark optimize execution by determining the best task order and handling fault tolerance.

17.Have you faced an out-of-memory issue, and how did you handle it?

Yes, out-of-memory issues can occur with large datasets. Solutions include increasing executor memory, optimizing data partitioning, reducing shuffle data, and using off-heap storage if available.

18.On a daily basis, how many jobs do you create or handle?

This depends on data needs. In high-demand environments, it’s common to manage 10–20 jobs daily, including monitoring and troubleshooting.

19.How much data do you handle in a month?

Handling monthly data can reach multiple terabytes, especially with weekly extractions from databases like RDS or MySQL, depending on the data volume and ETL pipeline complexity.

20.What libraries do you use in AWS Glue, and how do you install external libraries if needed?

Libraries include Pandas, Boto3, and PySpark. External libraries are packaged as .zip files and added to Glue jobs via S3 paths in the job configuration.

21.How does your team work, and what is your key role?

Our team follows an agile workflow, collaborating closely to build and maintain data pipelines. My key role is developing ETL pipelines, optimizing data processing, and ensuring data quality and compliance.

22.What transformations on data have you done?

Common transformations include data cleansing, filtering, normalization, feature engineering, and aggregations.

23.For every job, did you personally create scripts, and how do you test them?

Yes, scripts are usually created for each job. Testing involves unit tests, data validation, and dry runs to ensure logic is correct before deployment.

24.On a single dataset, how much time do you spend, and what is the average size of a dataset?

It depends on complexity, but a dataset could take from a few hours to a few days to process if it’s several GBs, involving multiple transformations and validations.

25.How do you monitor job execution and worker nodes in Glue?

AWS Glue jobs are monitored using AWS CloudWatch, which provides logs, job statuses, and performance metrics, allowing for quick identification and resolution of issues.

Hot News