Optimizing slow queries can involve several techniques:
WHERE
, JOIN
, and GROUP BY
clauses.SELECT *
retrieves all columns and can slow down performance.JOINs
or Common Table Expressions (CTEs).LIMIT
clause to restrict the number of records retrieved.EXPLAIN
to analyze how SQL is executing the query and identify potential bottlenecks.pandas
and SQLAlchemy
can be used for extraction.pandas
can be used for these operations.pyodbc
or SQLAlchemy
can be used for loading data into SQL databases. Example:
import pandas as pd
from sqlalchemy import create_engine
# Extract
data = pd.read_csv('data.csv')
# Transform
data['price'] = data['price'].fillna(data['price'].mean())
# Load
engine = create_engine('mysql://user:password@host/dbname')
data.to_sql('processed_data', engine, if_exists='replace')
dropna()
.fillna()
. For categorical data, you can replace missing values with the most frequent category.
df['column'] = df['column'].fillna(df['column'].mean())
NULL
for columns from the right table.NULL
values.NULL
values are used to fill in the gaps. Example:SELECT a.*, b.*
FROM TableA a
LEFT JOIN TableB b ON a.id = b.id;
pandas
, use drop_duplicates()
to remove duplicates.df = df.drop_duplicates()
Alternatively, you can group records using groupby()
and apply aggregate functions to consolidate duplicate entries based on specific columns.
matplotlib
, seaborn
, or plotly
can handle complex data visualizations. For interactive dashboards, use Power BI
or Tableau
.pandas
.OpenCV
for image processing.This provides an overview of technical concepts commonly tested in data analyst interviews at Capgemini. Each answer incorporates best practices and real-world applications.