Put simply, data warehousing is the act of extracting transactional data from one or more operational databases (such as CRM and ERP), transforming the data into a more readable format and loading it into an “offline” database.
Why would you do this?
The main reason an organisation would think of creating a data warehouse is for the purpose of analysing and reporting on big data from one or more data sources easily (using visual analytical tools such as Tableau), without affecting performance on live systems.
Businesses with simpler, smaller systems may be able to achieve their business intelligence requirements by using a combination of well written views and stored procedures that combine and flatten data from multiple databases without too much of a noticeable performance hit, however there are always exceptions to this.
Performance can become an issue when database queries are run during peak hour (data) traffic or when databases start to exceed a certain size and flattening data in real time takes so long that accidently refreshing a query or closing it is enough to cause a mouse to fly across the room.
I can do a lot of cool things using combinations of SQL views, stored procedures and functions, however, when performance gets to a point as mentioned above, then setting up a data warehouse is the only way forward.
Benefits of a data warehouse
- Integrate data from multiple sources into a single database and single related table.
- Prevent database performance issues on live production databases caused by attempts to run large, long running analysis queries.
- Improvement of data quality by showing friendly descriptions instead of codes.
- Present information from multiple data sources consistently.
- Restructure the data so that it makes sense to the end users.
- Improve query performance by restructuring and flattening data.
Is the data in a data warehouse real-time?
It can be, however in each case that I’ve worked on a data warehouse the client has never had a need to have up to the minute reporting. The data is typically triggered to go through the ETL (Extract, Transform and Load) process daily, weekly or even monthly.
What triggers the ETL process?
If you have SQL Server 2005 or greater and it is either “Standard” or “Enterprise” edition, then SQL Server Integration Services (SSIS) is used to schedule and migrate the data from each source database. SSIS is a Visual Studio project using an SQL Server Data tools project template. This is how business logic can be built into the ETL.
There are workarounds for clients that are running MySQL, SQL Server Express or users of SaaS systems (such as NetSuite and Xero). In these instances, custom .NET applications can be developed that are configured to run as a Windows Scheduled Task and perform the extraction of data using ODBC or web services.