1.How to deploy python code on AWS
Ans ::The AWS SDK for Python (Boto3) enables you to use Python code to interact with AWS services like Amazon S3
2.What id versioning in s3?
Ans : You can use S3 Versioning to keep multiple versions of an object in one bucket and enable you to restore objects that are accidentally deleted or overwritten. For example, if you delete an object, instead of removing it permanently, Amazon S3 inserts a delete marker, which becomes the current object version.
3.How to create crawler?
To create a crawler that reads files stored on Amazon S3 On the AWS Glue service console, on the left-side menu, choose Crawlers. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details. In the Crawler name field, enter Flights Data Crawler, and choose Next, submit info
4.HOW to create cluster?
From the navigation bar, select the Region to use. In the navigation pane, choose Clusters. On the Clusters page, choose Create Cluster. For Select cluster compatibility, choose one of the following options and then choose Next Step
5.what u did in athena?
Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3. Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Basically we do data validation by using Athena
6.what is ETL?
ETL stands for extract, transform, and load and is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake
OR
ETL->
Extraction: Data is taken from one or more sources or systems. The extraction locates and
identifies relevant data, then prepares it for processing or transformation. Extraction allows
many different kinds of data to be combined and ultimately mined for business intelligence.
➢ Transformation: Once the data has been successfully extracted, it is ready to be
refined. During the transformation phase, data is sorted, organized, and cleansed.
For example, duplicate entries will be deleted, missing values removed or enriched,
and audits will be performed to produce data that is reliable, consistent, and usable.
➢ Loading: The transformed, high quality data is then delivered to a single, unified
target location for storage and analysis.
Data Bricks Interview Questions
1.What is Databricks, and how does it differ from other big data processing frameworks like Hadoop and Spark?
2.Can you walk us through the process of creating a new Databricks cluster and configuring it for your specific use case?
3.How do you optimize performance when working with large data sets in Databricks?
4.How do you handle data security in Databricks, especially when dealing with sensitive data?
5.What are some common data transformations and analyses you can perform using Databricks, and what are the advantages of using Databricks for these tasks?
6.Can you describe a time when you used Databricks to solve a challenging data problem, and how you went about tackling that problem?
7.How do you handle errors and debugging when working with Databricks notebooks or jobs?
8.How do you monitor and track usage and performance of your Databricks clusters and jobs?
9.Can you walk us through a typical workflow for developing and deploying a Databricks-based data pipeline?
10.What are some best practices for optimizing cost and resource utilization when working with Databricks clusters?
Real time interview questions
1.what are your data sources
Ans: my data sources are like S3 in that data lake or diff files like csv, excel, or database
2.what is the latency of your data
Ans: actually it depends on the business requirement sometimes we have to do weekly jobs sometimes we have to do monthly data pipeline
3.what is vol of your data in daily basis
Ans: Around 10 GB data is processing daily
4.how many table do you have in your storage
Ans: Actually i didn’t count it but it may be 300 or 400 or may be more than this
5.What are the transformation you are using in daily
Ans: we are using withcolumn, distinct, joins, union, date formatting, dropduplicates, filter
6.how do u use incremental data in your project or pipeline
Ans: incremental as, In pipeline we write data as per our batch date. we overwrite new data to the final table
7.where u r using partition tables
Ans: mostly we r using partition tables in target and its very imp to partition a table and we are doing it on batch date
because of its simple to query and also help in powerbi to process this query faster
8.what is your final file format and why u r using parquet format
Ans: we are using parquet format, and so we are using spark and parquet works better with spark and also it has lot of compressing
ability, and it also stored data in nested structured and columnar format.
9.how did u submit spark job
Ans:
https://sparkbyexamples.com/spark/spark-submit-command/
or
https://spark.apache.org/docs/latest/submitting-applications.html#:~:text=The%20spark%2Dsubmit%20script%20in,application%20especially%20for%20each%20one.
10.how u decide the parameter and resources to configure the spark job
Ans: it depends on file size if we processing a file is large, then we have to see the no. of executors then we have to see how can we increase executor core and memory so our data pipeline execute faster, but generally there are default set of parameter that we use
11.have u ever used repartition
Ans: Yes , but only a few times b’cos its very costly operation and it shuffles data in many partitions. So we are not using it on a daily basis.
12.what are the common error you face while running a datapipeline
Ans: Syntax error
- Data type mismatch
- Missing values or corrupted data
- Lack of resources
- Connection issue
- Permission issue
13.how did you solve datapipeline issue
-correct the syntax
- You can use data validation or data cleansing tools to correct data types and to handle missing values
- You can optimize the performance of your pipeline by using efficient algorithms, reducing the size of data, or scaling up your computing resources. You can also monitor resource usage and adjust your pipeline accordingly
- You can configure retries or error handling mechanisms in your pipeline to handle network or connection errors.
- You can ensure that your pipeline has the necessary permissions to access data and perform operations by configuring access control and security mechanisms.