What is a data warehouse (and a lakehouse)?
A data warehouse is a central database optimized for analysis, where data from different operational systems is consolidated, cleansed and historized. A lakehouse combines the flexibility and low storage cost of a data lake with the structure and performance of a warehouse, providing a shared foundation for reporting, analytics and machine learning.
Also known as: DWH · enterprise data warehouse · EDW · data lakehouse · lakehouse
Where a data warehouse is used
A data warehouse separates the analytical world from operational systems. Instead of running reports directly against ERP, CRM or production databases, the relevant data is extracted regularly, unified and stored in a model optimized for queries. This creates a consistent, historized data basis on which reporting and analysis can run reliably and quickly.
The lakehouse is the more modern expression of this idea. Data first lands cheaply in a data lake (object storage) and is then given transactional safety, schema management and performance through open table formats such as Delta Lake. This allows structured tables for classic reporting and unstructured or semi-structured data for data science to live on the same platform.
A practical example
A typical scenario: order data sits in the ERP, customer data in the CRM, time tracking in Excel and sensor data in a separate database. A data warehouse or lakehouse brings these sources together, unifies keys and terms, and provides a clean layer that Power BI can build on directly.
At smiit, the dy Project AG data platform is a concrete example: for a large construction project worth over 1 billion CHF, data from SQL Server, Excel files and REST APIs was consolidated on Azure Databricks in a lakehouse and refined along a medallion architecture (bronze/silver/gold).
Benefits & typical use cases
A data warehouse or lakehouse pays off as soon as reporting spans several sources, large data volumes are involved, or a reliable history is required.
- A single source of truth for management reporting and controlling across system and department boundaries
- Historisation: metrics can be compared over time even when operational systems only hold the current state
- Fast queries without burdening the operational systems
- One platform that enables both reporting and machine learning on the same refined data (lakehouse)
How it differs from related terms
A data lake stores raw data without a fixed schema; it is cheap but hard to use analytically without refinement. A classic data warehouse is highly structured and optimized for SQL reporting but less flexible for unstructured data. The lakehouse bridges both worlds. It is loaded via ETL or ELT processes, structured through data modeling and approaches such as the medallion architecture. Power BI is the analytical surface that sits on top of the refined layer, not the warehouse itself.
How smiit works with it
smiit designs and builds data warehouses and lakehouses for mid-sized companies, preferably in the Microsoft and Azure ecosystem. From connecting source systems through modeling and refinement to governance, the result is a data platform that makes reporting and analytics robust instead of merely collecting data.
Common mistakes & misconceptions
- A data warehouse is not just a big database; it is optimized for analytical queries and integrates data from many sources into a consistent, historized model.
- Many think a data warehouse is meant for real-time transactions, but it is built for read and analytical workloads, not day-to-day operational processing (OLTP).
- A common mistake is to confuse a data warehouse with a data lake. The lake stores raw data in any format, while the warehouse holds structured, modeled data.
Frequently asked questions
What is the difference between a data warehouse and a data lake?
A data lake stores raw data cheaply and without a fixed schema, whereas a data warehouse contains structured data prepared for analysis. A lakehouse combines both approaches on one platform.
Do we as a mid-sized company even need a data warehouse?
As soon as reporting consolidates several source systems, large data volumes occur, or a reliable history is needed, a central data platform pays off. For very small data volumes, a direct connection may be sufficient at first.
Does a lakehouse only run in the cloud?
In practice a lakehouse is almost always run in the cloud, for example on Azure with Azure Databricks or Microsoft Fabric, because cheap object storage and scalable compute come together there.
What is the difference between ETL and ELT when loading?
With ETL, data is transformed first and then loaded; with ELT, it is loaded first and then transformed within the target platform. Modern lakehouses often use ELT because cheap storage and scalable compute make it feasible to land raw data first and refine it there.
How current is the data in a data warehouse?
It depends on the load interval. Many warehouses are refreshed nightly or several times a day (batch); for near-real-time data, more frequent or streaming loads are possible. The right cadence depends on business needs and cost.
Related terms
Sources & further reading
Want to put this topic to work in your company?
Updated · Back to the glossary