Big Data
A new buzzword that has been capturing the attention of businesses lately is big data. The term refers to such massively large data sets that conventional database tools do not have the processing power to analyze them. For example, Walmart must process over one million customer transactions every hour. Storing and analyzing that much data is beyond the power of traditional database-management tools. Understanding the best tools and techniques to manage and analyze these large data sets is a problem that governments and businesses alike are trying to solve.
Metadata
The term metadata can be understood as “data about data.” For example, when looking at one of the values of Year of Birth in the Students table, the data itself may be “1992″. The metadata about that value would be the field name Year of Birth, the time it was last updated, and the data type (integer).
Another example of metadata could be for an MP3 music file, like the one shown in the image below; information such as the length of the song, the artist, the album, the file size, and even the album cover art, are classified as metadata. When a database is being designed, a “data dictionary” is created to hold the metadata, defining the fields and structure of the database.
Data Warehouse
As organizations have begun to utilize databases as the centerpiece of their operations, the need to fully understand and leverage the data they are collecting has become more and more apparent. However, directly analyzing the data that is needed for day-to-day operations is not a good idea; we do not want to tax the operations of the company more than we need to. Further, organizations also want to analyze data in a historical sense: How does the data we have today compare with the same set of data this time last month, or last year? From these needs arose the concept of the data warehouse.
The concept of the data warehouse is simple: extract data from one or more of the organization’s databases and load it into the data warehouse (which is itself another database) for storage and analysis.
However, the execution of this concept is not that simple. A data warehouse should be designed so that it meets the following criteria:
• It uses non-operational data. This means that the data warehouse is using a copy of data from the active databases that the company uses in its day-to-day operations, so the data warehouse must pull data from the existing databases on a regular, scheduled basis.
• The data is time-variant. This means that whenever data is loaded into the data warehouse, it receives a time stamp, which allows for comparisons between different time periods.
• The data is standardized. Because the data in a data warehouse usually comes from several different sources, it is possible that the data does not use the same definitions or units.
Benefits of Data Warehouses
Organizations find data warehouses quite beneficial for a number of reasons:
• The process of developing a data warehouse forces an organization to better understand the data that it is currently collecting and, equally important, what data is not being collected.
• A data warehouse provides a centralized view of all data being collected across the enterprise and provides a means for determining data that is inconsistent.
• Once all data is identified as consistent, an organization can generate one version of the truth. This is important when the company wants to report consistent statistics about itself, such as revenue or number of employees.
• By having a data warehouse, snapshots of data can be taken over time. This creates a historical record of data, which allows for an analysis of trends.
• A data warehouse provides tools to combine data, which can provide new information and analysis.