Pyspark Interview Questions

By.Yogesh

14/11/2024

What is Apache Spark ?

Apache Spark is an open-source, distributed computing system designed for fast processing of large datasets

What are the key features of PySpark?

– Distributed computing for large-scale data processing. – High-level APIs for data manipulation (DataFrame and SQL). – Fault tolerance via RDDs.

What is a SparkSession in PySpark

SparkSession is the entry point to PySpark functionality. It is responsible for creating DataFrames, executing SQL queries, and managing configurations. It replaces SQLContext and HiveContext in earlier versions of Spark.

What is the difference between select() and selectExpr() in PySpark?

– select(): Used for selecting columns directly by name or by applying functions to columns. – selectExpr(): Allows using SQL expressions to select and manipulate columns, providing more flexibility with complex expressions.