1.how to deploy python code on aws
Ans ::The AWS SDK for Python (Boto3) enables you to use Python code to interact with AWS services like Amazon S3
2. explain the architechture of pyspark
and–In your master node , you have the driver program, which drives your application.
Inside the driver program, the first thing you do is, you create a Spark Context. Assume that the Spark context is a gateway to all the Spark functionalities.
Now, this Spark context works with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node.
If you increase the number of workers, then you can divide jobs into more partitions and execute them parallelly over multiple systems. It will be a lot faster.
With the increase in the number of workers, memory size will also increase & you can cache the jobs to execute it faster.
3.Write a program for find largest second number from list
l =[10,20,30,30,30,40,4,4,4,4]
max1 = l[0]
smax = l[0]
for i in l:
if i>max1:
smax = max1
max1 = i
elif smax<i and i!=max1:
smax = i
OR
def secondmax(l):
list1 = [i for i in l if i < max(l)]
return max(list1)
secondmax([10,20,30,30,30,4,4,4,4,4])
4.Write a query to fetch details of employees whose EmpLname ends with an alphabet ‘A’ and contains five alphabets.
Select * from employee
Where ename like ‘%a’ and CHAR_LENGTH(ename) = 5;
5.What id versioning in s3?
You can use S3 Versioning to keep multiple versions of an object in one bucket and enable you to restore objects that are accidentally deleted or overwritten. For example, if you delete an object, instead of removing it permanently, Amazon S3 inserts a delete marker, which becomes the current object version.
6.How to create crawler?
To create a crawler that reads files stored on Amazon S3
On the AWS Glue service console, on the left-side menu, choose Crawlers. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details. In the Crawler name field, enter Flights Data Crawler , and choose Next.
Submit info.
7.How to create cluster?
From the navigation bar, select the Region to use.
In the navigation pane, choose Clusters.
On the Clusters page, choose Create Cluster.
For Select cluster compatibility, choose one of the following options and then choose Next Step
8. how to calculate even and odd records form table?
Query to find even record:
SELECT * FROM EMPLOYEE
WHERE id IN(SELECT id FROM EMPLOYEE WHERE id%2 = 0)
9.Query to find odd record:
SELECT * FROM EMPLOYEE
WHERE id IN(SELECT id FROM EMPLOYEE WHERE id%2 <> 0)
10.Write a query to retrieve duplicate records from a table?
SELECT OrderID, COUNT(OrderID)
FROM Orders
GROUP BY OrderID
HAVING COUNT(OrderID) >1
11.what u did in athena?
Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3. Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.
Basically we do data validation by using Athena
12.what is ETL?
ETL stands for extract, transform, and load and is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake
OR
ETL->
Extraction: Data is taken from one or more sources or systems. The extraction locates and
identifies relevant data, then prepares it for processing or transformation. Extraction allows
many different kinds of data to be combined and ultimately mined for business intelligence.
➢ Transformation: Once the data has been successfully extracted, it is ready to be
refined. During the transformation phase, data is sorted, organized, and cleansed.
For example, duplicate entries will be deleted, missing values removed or enriched,
and audits will be performed to produce data that is reliable, consistent, and usable.
➢ Loading: The transformed, high quality data is then delivered to a single, unified
target location for storage and analysis.
13.Fetch 5th highest the salary without limit and top
select * from(select ename, salary, dense_rank() over(order by salary desc)r from Emp) where r=5
14.Write query for drop duplicate records
Select *, count(id)
From table_name
Group by id
Having count(id)=1
15.Now, assume that you have two tables: “employees” and “salaries”.
The employee table has basic information: ID, first name, last name, email address, address etc.
Salaries table has employee-id and salary. Query to be executed is the same: “List names of all the employees
whose salary is greater than or equal to Rita’s salary“.
Ans : select e.first_name from employee e
inner join salary s
on e.id = s.id
where s.salary > = (select s.salary from employee e
inner join salaries s
on e.id = s.id and e.first_name = ‘reeta’);
16.How will u convert pyspark dataframe into pandas dataframe?
pandasDF = pysparkDF.toPandas()
print(pandasDF)