Data Warehouse Description
Since the early 1990s, data warehouses have been at the forefront of information technology applications as a way for organizations to effectively use digital information for business planning and decision making. As information professionals, we no doubt will encounter the data warehouse phenomenon if we have not already been exposed to it in our work. Hence, an understanding of data warehouse system architecture is or will be important in our roles and responsibilities in information management.
Data warehouses are computer based information systems that are home for "secondhand" data that originated from either another application or from an external system or source. Warehouses optimize database query and reporting tools because of their ability to analyze data, often from disparate databases and in interesting ways. They are a way for managers and decision makers to extract information quickly and easily in order to answer questions about their business. In other words, data warehouses are read-only, integrated databases designed to answer comparative and "what if" questions. Unlike operational databases that are set up to handle transactions and that are kept up to date as of the last transaction, data warehouses are analytical, subject-oriented and are structured to aggregate transactions as a snapshot in time.
The steps in planning a data warehouse are identical to the steps for any other type of computer application. Users must be involved to determine the scope of the warehouse and what business requirements need to be met. After selecting a focus area, for example, analyzing the use of state government records over time, a data warehouse team of business users and information professionals compiles a list of different types of data that should go into the warehouse. (See data acquisition/collection). After business requirements have been gathered and validated, data elements are organized into a conceptual data model. The conceptual model is used as a blueprint to develop a physical database design. As in all systems design projects, there are a number of iterations, prototypes, and technical decisions that need to be made between the steps of systems analysis, design, development, implementation, and support.
The data warehouse team must determine what data should go into the warehouse and where those particular pieces of information can be found. Some of the data will be internal to an organization. In other cases, it can be obtained from another source. Another team of analysts and programmers create extraction programs to collect data from the various databases, files, and legacy systems that have been identified, copying certain data to a staging area outside of the warehouse. At this point, they ensure that the data has no errors (cleansing), and then copy it all into the data warehouse. This source data extraction, selection, and transformation process is unique to data warehousing. Source data analysis and the efficient and accurate movement of source data into the warehouse environment are critical to the success of a data warehouse project.
Good metadata is essential to the effective operation of a data warehouse and it is used in data acquisition/collection, data transformation, and data access. Acquisition metadata maps the translation of information from the operational system to the analytical system. This includes an extract history describing data origins, updates, algorithms used to summarize data, and frequency of extractions from operational systems. Transformation metadata includes a history of data transformations, changes in names, and other physical characteristics. Access metadata provides navigation and graphical user interfaces that allow non-technical business users to interact intuitively with the contents of the warehouse. And on top of these three types of metadata, a warehouse needs basic operational metadata, such as procedures on how a data warehouse is used and accessed, procedures on monitoring the growth of the data warehouse relative to the available storage space, and authorizations on who is responsible for and who has access to the data in the data warehouse and data in the operational system.
Data in a data warehouse should be reasonably current, but not necessarily up to the minute, although developments in the data warehouse industry have made frequent and incremental data dumps more feasible. Data marts are smaller than data warehouses and generally contain information from a single department of a business or organization. The current trend in data warehousing is to develop a data warehouse with several smaller related data marts for specific kinds of queries and reports.
As with any information system, trustworthiness of data is determined by the trustworthiness of the hardware, software, and the procedures that created them. The reliability and authenticity of the data and information extracted from the warehouse will be a function of the reliability and authenticity of the warehouse and the various source systems that it encompasses. In data warehouse environments specifically, there needs to be a means to ensure the integrity of data first by having procedures to control the movement of data to the warehouse from operational systems and second by having controls to protect warehouse data from unauthorized changes. Data warehouse trustworthiness and security are contingent upon acquisition, transformation and access metadata and systems documentation.
May 20, 2002