1. why are you loading data in hive metastore?
Ans. Security purpose – 3naming convention as per the requirements of client
No need to again data extract from bucket
Data analysis purpose
2.how to schedule data pipeline?
Ans. In databrics at time of creating job we can schedule the pipeline using workflow
There is directly option available in workflows
Also we can schedule notebook as well
3. if your job aborted then what can you do?
Ans. Check the reason behind them – checking the code check the cluster type worker type according to size of data There is directly option available in workflows
4. how to select cluster configuration?
Ans. As per the data size we can select configuration that is available cluster
All purpose compute and job compute
At time of job testing I can use the job compute
5. complications in your project?
Ans.In that scenario
There is big data for batch processing therefore job had aborted many times then chek cluster
configuration chek the node on the basis of data size but its not worked . then we connected to
client and tell them that complication and suggest the stream processing and then we convert
batch processing to stream processing
Stream processing -is a data management technique that involves ingesting a continuous
data stream to quickly analyze, filter, transform or enhance the data in real time
In that scenario – job aborted due to stage failure
Reason – total size of the 171 task in a job is bigger than driver size
Solve – in spark there is one configuration we hsd used – spark.driver.maxresultsize 40g
6. is there any layers
Ans. Databricks proposes 3 layers of storage
Bronze (raw data),
Silver (Clean data)
and Gold (aggregated data).
It is clear in terms of what these storage layers are meant to store
7. how do you orchestration of the code?
Ans. Orchestration is the automated configuration, management, and coordination of computer
systems, applications, and services stringing together multiple tasks in order to execute a
larger workflow or process. These processes can consist of multiple tasks that are automated
and can involve multiple systems.
8. which tool do you use for scheduling?
Ans. In databrics workflows there is scheduling function is available that function I used for
scheduling
In scheduled there are two types if trigger
Scheduled and continuous
9.in which case you use python and in which case you use pyspark ?
Ans. In framework there are different module which contains class and in that class created different
methods in that framework we had used oops properties like inheritance , constructor and at
time of read and write I can use try and except method that is part of the python
And pyspark used for the read and writing the file or data and make some transformation by
using pyspark like timestamp with column, select, selectexpr, etc
10. in which use case or scenario you used in pandas?
Ans. Pandas make it simple to do many of time consuming , repetitive tasks associated with working
with data including- data cleansing, data normalization, merges and joins, and much more.