The first lecture this week expanded on the theme established in week 2: namely the design of star and snowflake schemas using fact and dimensional tables. For fact tables, we learned about the different types of facts (additive, semi-additive, and non-additive), and what was really interesting was the concept of
factless fact tables. Factless fact tables can be used to record events or conditions in the data, as described by Datawarehouse Concepts (2012). The basic idea is that a factless fact table ties various dimension tables together to record some event or condition. So, a factless fact table just contains foreign keys of various dimension tables. Very interesting stuff!
With respect to dimensions, we learned about different types of dimensions used in data warehouses:
- degenerate dimensions
- role-playing dimensions
- junk dimensions
- slow-changing dimensions
Degenerate dimensions are interesting, because they are dimensions that occur in fact tables! While this may seem counter-intuitive, it makes sense when you think about it. These types of dimensions are used to provide information about a particular transaction.
Degenerate dimensions commonly occur when the fact table’s grain is a single transaction (or transaction line). Transaction control header numbers assigned by the operational business process are typically degenerate dimensions, such as order, ticket, credit card transaction, or check numbers. These degenerate dimensions are natural keys of the “parents” of the line items. (Becker, 2003)
For me, one of the really interesting ideas was that of "slow-changing dimensions". I suppose this is because I currently work for a company that produces versioning software, so the idea of tracking how dimensions change over time is inherently interesting. I guess the real challenge here is identifying what dimensions are important to version, how many versions are important to keep around, etc. Choosing incorrectly, especially failing to version a dimension, might prove problematic in the future when someone might decide that they
do want a historical view of that dimension. Of course, versioning everything would be ideal, but is it realistic? Especially when considering the storage requirements of versioned dimensions. Margy Ross (2013) outlines various techniques for dealing with different types of slowly-changing dimensions. In a solution that needs to deal with slowly changing dimensions, it's likely that a combination of these techniques would be used.
In the first lecture, we also learned about surrogate keys, and how they can be useful because, unlike primary keys, they do not have "embedded intelligence". An example of embedding intelligence in a primary key can be seen in the Customer table below. Here, while the CustomerID values are unique (and hence primary keys), they have "G" and "S" embedded in them, which indicates the type of customer: "Gold" or "Silver".
Customer ID (PK) | First Name | Last Name |
G100 | Mary | Jones |
S100 | Bob | Smith |
G101 | Yvette | Lancaster |
S101 | Sonja | Spenser |
G102 | Matt | Dawson |
S102 | Larry | Melrose |
Instead, by adding another column with a surrogate key, we decouple the embedded meaning from the primary key used in operations. For example, introducing a surrogate key into the above table might result in something that looks like this:
CID (PK) | Customer ID (PK) | First Name | Last Name |
0001 | G100 | Mary | Jones |
0002 | S100 | Bob | Smith |
0003 | G101 | Yvette | Lancaster |
0004 | S101 | Sonja | Spenser |
0005 | G102 | Matt | Dawson |
0006 | S102 | Larry | Melrose |
Other benefits of using surrogate keys outlined in the lecture are that they increase operational efficiency and reduce the impact of changes to the 'real' primary key (if Matt Dawson in the example above is demoted to Silver).
The second lecture focussed on data quality, specifically on the use of data profiling to help determine data quality. Outlined was the basic process of data profiling, along with some basic steps involved in the data profiling process:
- Creating The Profiling Plan -- planning on how the data will be analyzed -- understanding the nature of the data (tables, columns) and determining how to examine for primary keys, foreign keys, and business rule violations.
- Interpreting the Profiling Results -- determining whether the data is high or low in quality and what needs to be done with the data (i.e. cleaning).
- Cleansing the Data -- preparing the data for ETL.
Here, I think that it is easy to fall into the trap where you become too reliant on particular tools that are being used. Instead, it is good to keep the 'big picture' in mind and that a combination of tools and techniques will likely be useful for profiling the data. The process may also be iterative, in that the profiling results may lead one to re-run the data profiling software (or use different software) on the datasets to provide another perspective and look at the data from a different vantage point.
The lecture then went on to discuss some examples of data profiling, demonstrating how one would go through the process of identifying primary keys, strange data that needs to be cleaned, referential integrity checks and business rule checks. For performing the initial analysis, tools and automation will likely be needed, as the process can be quite time consuming and error prone. In the lecture, the Gartner Magic Quadrant analysis of various data profiling tools was presented. Each tool has weaknesses and strengths, so some investigation would need to be done in order to determine which tool(s) would be best for a given analysis.
One of the main takeaways from this week's materials is that the quality of the data is paramount. You need to be able to distinguish good data from bad data (and it's not always as easy as just looking for the one in the turtleneck)...
The primary reason that 40% of business initiatives fail is due to
poor quality data. Data inconsistencies, lack of completeness, duplicate
records, and incorrect business rules often result in inefficiencies,
excessive costs, compliance risks and customer satisfaction issues.
Therefore improving the quality of your enterprise data will have a huge
impact on your business. (IBM Whitepaper, 2012)
The lecture concluded with a discussion of Master Data Management (MDM), and how data quality analysis is a part of this larger process. MDM is defined as "the processes, governance, policies, standards and tools that consistently defines and manages the critical data of an organization to provide a single point of reference" (see http://en.wikipedia.org/wiki/Master_data_management). How an organization approaches MDM can vary greatly, as the size of the organization presents different challenge characteristics (as described by Graham, 2010):
Organization Size | Central Challenge |
Small | Small amounts of master data. Data integration is not a top priority. |
Mid-size | Data integration starts to become difficult for an organization. Data stewards can be clearly defined. |
Large | Huge amounts of master data and system integration. Mostly homogeneous data silos with relatively consistent attributes. Data stewards may now have a full time role. |
Conglomerate | Many disparate businesses that may create many groups of data (i.e., multiple product lines, general ledgers, and so on). |
So one question that arises is how MDM fits into the "big picture" of information management within an organization. In his blog article, Weigel (2013) emphasizes Information Governance as a "discipline that oversees the management of your enterprise’s information". He goes on to describe how business goals and information management initiatives can be aligned under Information Governance:
Master Data Management is a key initiative for the success of overall information governance. Ultimately, the business needs to be able to rely on the data in order to make strategic decisions. If the data cannot be trusted, can the decisions based on that data be?
References
Becker, Bob. "Fact Table Core Concepts." Kimball Group. N.p., 3 June 2003. Accessed Web. 9 Feb. 2014. <http://www.kimballgroup.com/2003/06/03/design-tip-46-another-look-at-degenerate-dimensions/>.
"DATAWAREHOUSE CONCEPTS." What is a FACTLESS FACT TABLE?Where we use Factless Fact. N.p., 4 Aug. 2012. Accessed Web. 9 Feb. 2014. <http://dwhlaureate.blogspot.com/2012/08/factless-fact-table.html>.
"Garbage in, quality out. Now that's different.." IBM. IBM, Oct. 2012. Accessed Web. 9 Feb. 2014. <http://www-01.ibm.com/software/info/rte/bdig/ii-5-post.html>.
Graham, Tyler. "Organizational Approaches to Master Data Management." Organizational Approaches to Master Data Management. Microsoft, 1 Apr. 2010. Accessed Web. 9 Feb. 2014. <http://msdn.microsoft.com/en-us/library/ff626496.aspx>.
Harris, Jim. "OCDQ Blog." Adventures in Data Profiling. N.p., 3 Aug. 2009. Accessed Web. 9 Feb. 2014. <http://www.ocdqblog.com/adventures-in-data-profiling/>.
Ross, Margy. "Dimension Table Core Concepts." Kimball Group. N.p., 5 Feb. 2013. Accessed Web. 9 Feb. 2014. <http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7/>.
Weigel, Niels. "Data Profiling and Data Cleansing - Use Cases and Solutions at SAP." Recent Posts. N.p., 12 June 2013. Accessed Web. 9 Feb. 2014. <http://scn.sap.com/community/enterprise-information-management/blog/2013/06/12/data-profiling-and-data-cleansing--use-cases-and-solutions-at-sap>.