Optimizing slow queries can involve several techniques:
WHERE, JOIN, and GROUP BY clauses.SELECT * retrieves all columns and can slow down performance.JOINs or Common Table Expressions (CTEs).LIMIT clause to restrict the number of records retrieved.EXPLAIN to analyze how SQL is executing the query and identify potential bottlenecks.pandas and SQLAlchemy can be used for extraction.pandas can be used for these operations.pyodbc or SQLAlchemy can be used for loading data into SQL databases. Example:
import pandas as pd
from sqlalchemy import create_engine
# Extract
data = pd.read_csv('data.csv')
# Transform
data['price'] = data['price'].fillna(data['price'].mean())
# Load
engine = create_engine('mysql://user:password@host/dbname')
data.to_sql('processed_data', engine, if_exists='replace')
dropna().fillna(). For categorical data, you can replace missing values with the most frequent category.
df['column'] = df['column'].fillna(df['column'].mean())
NULL for columns from the right table.NULL values.NULL values are used to fill in the gaps. Example:SELECT a.*, b.*
FROM TableA a
LEFT JOIN TableB b ON a.id = b.id;
pandas, use drop_duplicates() to remove duplicates.df = df.drop_duplicates()
Alternatively, you can group records using groupby() and apply aggregate functions to consolidate duplicate entries based on specific columns.
matplotlib, seaborn, or plotly can handle complex data visualizations. For interactive dashboards, use Power BI or Tableau.pandas.OpenCV for image processing.This provides an overview of technical concepts commonly tested in data analyst interviews at Capgemini. Each answer incorporates best practices and real-world applications.