Apache Spark is an open-source, distributed computing system designed for fast processing of large datasets
– Distributed computing for large-scale data processing. – High-level APIs for data manipulation (DataFrame and SQL). – Fault tolerance via RDDs.
– SparkSession is the entry point to PySpark functionality. It is responsible for creating DataFrames, executing SQL queries, and managing configurations. It replaces SQLContext and HiveContext in earlier versions of Spark.
What is the difference between select() and selectExpr() in PySpark?
– select(): Used for selecting columns directly by name or by applying functions to columns. – selectExpr(): Allows using SQL expressions to select and manipulate columns, providing more flexibility with complex expressions.