Data Warehouse Interview Questions and Answers for 2025

1. What is a Data Warehouse?

A Data Warehouse is a centralized repository of integrated data from multiple sources, designed for reporting, analysis, and decision-making.
It stores historical data optimized for analytical querying.
Difference from OLTP (Transactional Databases):
- OLTP: Used for daily transactional processes (inserts, updates, deletes).
- Data Warehouse: Optimized for complex queries and data analysis, with read-heavy operations.

2. What are the key components of a Data Warehouse architecture?

Data Sources: External systems providing raw data (databases, flat files, APIs).
ETL (Extract, Transform, Load): Process of extracting data, transforming it into a usable format, and loading it into the warehouse.
Data Staging Area: Temporary area for data cleansing and transformation before loading into the Data Warehouse.
Data Warehouse: Central repository storing integrated data.
OLAP Cubes: Pre-aggregated data structures for fast querying.
Front-End Tools: BI and reporting tools used to present data to users.

3. What is the difference between OLAP and OLTP?

OLAP (Online Analytical Processing):
- Used for complex queries and multidimensional analysis.
- Optimized for read-heavy operations (e.g., sales reporting).
OLTP (Online Transaction Processing):
- Used for day-to-day operations and transactional data.
- Optimized for write-heavy operations (e.g., order processing, banking transactions).

4. What is ETL? Explain its role in a Data Warehouse.

ETL (Extract, Transform, Load) is the process of:
- Extracting data from source systems.
- Transforming the data (cleaning, filtering, aggregating) into a usable format.
- Loading the transformed data into the Data Warehouse for analysis.
Role in Data Warehouse: Ensures that data is integrated, accurate, and ready for business analysis.

5. What is a Star Schema?

A Star Schema is a Data Warehouse schema where a central Fact Table is surrounded by Dimension Tables.
The Fact Table contains measurable data (e.g., sales revenue).
Dimension Tables contain descriptive attributes (e.g., product, time, location).
It’s called a “star” because of its structure, with the Fact Table at the center.

6. What is a Snowflake Schema?

A Snowflake Schema is similar to the Star Schema, but the Dimension Tables are normalized into multiple related tables.
For example, a Product Dimension could be split into Product Category, Product Sub-Category, and Product tables.
Snowflake schemas can reduce redundancy but may lead to slower queries due to additional joins.

7. What is a Fact Table?

A Fact Table is a central table in a Data Warehouse that stores quantitative data (measures or metrics) used for analysis.
It typically includes:
- Foreign Keys: References to Dimension Tables.
- Measures: Numeric data (e.g., sales, revenue).

8. What is a Dimension Table?

A Dimension Table contains descriptive attributes that provide context to the data in the Fact Table.
It often includes:
- Attributes: Information about the entity (e.g., product name, customer demographics).
- Primary Key: A unique identifier for each record.

9. What is a surrogate key, and why is it used in Data Warehousing?

A Surrogate Key is a system-generated unique identifier used in place of a natural key (e.g., customer ID).
It helps to handle changes in natural keys and ensures data consistency in the Data Warehouse.
Surrogate keys simplify relationships between tables and avoid issues with natural key changes.

10. What is Data Mart?

A Data Mart is a subset of a Data Warehouse, typically focused on a specific business area (e.g., sales, marketing).
It is tailored to meet the needs of a particular department or team.
Difference from Data Warehouse:
- Data Warehouse: Integrates data from the entire organization.
- Data Mart: Focuses on specific business areas.

11. What is normalization? How is it different from denormalization?

Normalization: The process of organizing data to reduce redundancy, using multiple related tables (based on normalization rules like 1NF, 2NF, 3NF).
Denormalization: The process of combining tables to reduce joins and improve query performance, often used in Data Warehousing to optimize for read-heavy operations.

12. What are the various types of Data Warehouse architectures?

Single-Tier Architecture: Data is stored in one layer. Rarely used in modern systems due to scalability issues.
Two-Tier Architecture: Data is stored in two layers, typically a staging area and the Data Warehouse.
Three-Tier Architecture: Most common, consisting of:
1. Staging Layer: Temporary storage for raw data.
2. Data Warehouse Layer: Centralized repository for integrated data.
3. Presentation Layer: BI tools for reporting and analysis.

13. What is Slowly Changing Dimension (SCD)?

Slowly Changing Dimensions (SCD) handle the way dimension data changes over time.
Types of SCD:
- SCD Type 1: Overwrites old data with new data (no history).
- SCD Type 2: Adds new rows to keep track of historical data (e.g., changing customer address).
- SCD Type 3: Tracks limited history, e.g., storing previous and current values.

14. What is a Factless Fact Table?

A Factless Fact Table is a fact table that does not contain any numeric measures. Instead, it tracks the occurrence of events.
Example: Tracking student attendance to a class (no numeric measure, only event data).

15. Explain the concept of the Conformed Dimension.

A Conformed Dimension is a dimension table that is consistent across multiple Data Marts or Data Warehouses.
Example: A Time Dimension or Customer Dimension used in both sales and marketing Data Marts.

16. What is a Junk Dimension?

A Junk Dimension is a dimension table that groups unrelated attributes (e.g., flags or status indicators) into a single table to avoid creating too many dimension tables.
Example: A Junk Dimension could contain flags like “email opt-in” or “promotion type”.

17. Explain some commonly used ETL tools.

Informatica PowerCenter: A widely used ETL tool for extracting, transforming, and loading data.
Microsoft SQL Server Integration Services (SSIS): A platform for building ETL processes in SQL Server environments.
Talend: An open-source ETL tool used for data integration and transformation.
Apache NiFi: A data integration tool for automating data flow between systems.
DataStage: IBM’s ETL tool used to integrate and transform data.

18. What is the difference between a Data Warehouse and a Data Lake?

A Data Warehouse is structured and optimized for fast querying, with data cleaned, processed, and organized into tables with defined relationships.
A Data Lake stores raw, unstructured, or semi-structured data, often in its native format. It is designed to handle large volumes of diverse data types (text, images, logs, etc.) and allows more flexible, exploratory analysis.

19. What is an OLAP Cube?

An OLAP (Online Analytical Processing) Cube is a multi-dimensional data structure used to store and analyze data in a Data Warehouse.
It allows for fast querying and reporting by pre-aggregating data along multiple dimensions (e.g., time, product, region).
OLAP cubes are used in multidimensional analysis, enabling users to perform operations like drill-down, roll-up, and slicing and dicing of data.

20. What is Data Granularity?

Data Granularity refers to the level of detail at which data is stored in the Data Warehouse.
Higher granularity means more detailed data (e.g., transactions by minute), while lower granularity means aggregated data (e.g., monthly totals).
Choosing the right granularity is important for balancing performance and storage requirements.

21. What is Data Partitioning?

Data Partitioning is the process of dividing a large database table into smaller, more manageable pieces, called partitions.
Each partition can be stored and processed separately, improving query performance and manageability.
Partitioning can be done based on a range of values (e.g., by date or region) or hash-based methods.

22. What are the advantages and disadvantages of Data Warehousing?

Advantages:

Centralized repository for historical data and business intelligence.
Supports fast query performance for decision-making.
Data consistency and integrity across the organization.

Disadvantages:

High initial setup cost and resource-intensive.
Requires ongoing maintenance for data quality and ETL processes.
Can become complex to manage with a large volume of data.

23. What is the purpose of indexing in Data Warehousing?

Indexing in Data Warehousing improves query performance by providing quick access to data rows in large tables.
It reduces the time required for data retrieval by creating pointers to the data in the table, based on certain columns (e.g., product ID, customer name).
However, indexes come with additional storage costs and may slow down data load operations.

24. What is the role of metadata in Data Warehousing?

Metadata is data that describes other data in a Data Warehouse.
It provides information about the structure, sources, and transformations of the data stored in the Data Warehouse.
Examples of metadata include table names, column descriptions, data types, and ETL transformation rules.
Metadata helps with data governance, query optimization, and understanding data lineage.

25. What is an ETL Pipeline?

An ETL Pipeline is an automated workflow that moves data from source systems to a Data Warehouse by performing extraction, transformation, and loading operations.
ETL pipelines ensure that data is processed efficiently and consistently from multiple sources to the target Data Warehouse for analysis.
They can be scheduled to run at specified intervals or triggered by certain events.

26. What are the differences between a Full Load and Incremental Load in ETL?

Full Load: The entire dataset is loaded into the Data Warehouse, typically replacing any existing data.
- Pros: Simple, ensures complete data accuracy.
- Cons: Time-consuming, resource-intensive.
Incremental Load: Only new or updated data is loaded, typically using timestamps or change data capture techniques.
- Pros: More efficient, uses fewer resources.
- Cons: More complex to implement and manage, requires mechanisms to track changes.

27. What is the role of a Data Warehouse Architect?

A Data Warehouse Architect designs and oversees the structure, architecture, and implementation of the Data Warehouse.
They are responsible for:
- Defining data models (star schema, snowflake schema).
- Ensuring proper ETL design and data integration.
- Ensuring data quality, performance, and scalability of the system.
- Collaborating with stakeholders to define reporting and analytical requirements.