Wednesday, March 07, 2007

Notes on Business Intelligence Solution

These are some practical tips taken from this book - Practical Business Intelligence with SQL Server 2005.

Suggested Approach: Build a consistent relational data warehouse with a dimensional schema optimized for queries.

Overall Solution Diagram:

Choose between Star or Snowflake for a Dimension? Answer: It Depends

  • Choose snowflake when the dimension's attributes come from different sources.
  • Snowflake when the dimension has a strong natural hierarchy. It is easier to manage the ETL process in this case.
Use a surrogate key for every dimension table. Surrogate keys are used as primary identifiers for all dimension tables in the data warehouse, and every fact table record that refers to a dimension always uses the surrogate key rather than the business key(primary key). All relationships in the data warehouse use the surrogate key, including the relationships between different dimension tables in a snowflake structure. Because the data warehouse uses surrogate keys and the source systems use business keys, this means that one important step in the ETL process is to translate the business keys in the incoming transaction records into data warehouse surrogate keys before inserting the new fact records.

Making a List of Candidate Attributes and Dimensions When you are reviewing the information you have collected, look for terms that represent different ways of looking at data. A useful rule of thumb is to look for words such as by(as in, "I need to see profitability by product category"). If you keep a list of all these candidate attributes when you find them, you can start to group them into probable dimensions such as Product or Customer.

One thing to be careful of is synonyms: People often have many different ways of naming the same thing, and it is rare that everyone will agree on the definition of every term. Similarly, people in different parts of the business could be using the same term to mean different things. An important job during the modeling process is to identify these synonyms and imprecise names and to drive the business users toward consensus on what terms will mean in the data warehouse. A useful by-product of this process can be a data dictionary that documents these decisions as they are made.

Making a List of Candidate Measures At the same time that you are recording the attributes that you have found, you will be looking for numeric measures. Many of the candidate measures that you find will turn out to be derived from a smaller set of basic measures, but you can keep track of all them because they might turn out to be useful calculations that you can add into the OLAP cube later. The best candidates for measures are additive and atomic. That is, they can be added up across all the dimensions, including time, and they are not composed from other measures.

Grouping the Measures with the Same Grain into Fact Tables

Figuring out how to group measures into fact tables is a much more structured process than grouping related attributes into dimension tables. The key concept that determines what measures end up on a fact table is that every fact table has only one grain. After you have your list of candidate measures, you can set up a spreadsheet as shown in the Table below with the candidate dimensions on the columns and the candidate measures on the rows.

Product Customer Date
Sales Amount SKU Customer Day
Quantity SKU Customer Day
Budget Amount Category N/A Month

For each measure, you need to figure out the grain or level of detail you have available. For example, for a specific sales amount from a sales transaction, you can figure out the customer that it was sold to, the product SKU that they bought, and the day that they made the purchase, so the granularity of the sales amount measure is Product SKU by Customer by Day. For budget amount, the business is only producing monthly budgets for each product category, so the granularity is Product Category by Month.

From the example in Table, we end up with two different fact tables. Because the Sales Amount and Quantity measures both have the same granularity, they will be on the Sales fact table, which will also include Product, Customer, and Date dimension keys. A separate Budget fact table will have Product Category and Date dimension keys and a Budget Amount measure.

Identifying the granularity of every measure sounds simple in principle but often turns out to be difficult in practice. So, how do you know when you have made a mistake? One common sign is when you end up with some records in the fact table with values for one set of measures and nulls for the remainder. Depending on how you load the data, you could also see that a given numeric quantity ends up being repeated on multiple records on a fact table. This usually occurs when you have a measure with a higher granularity (such as Product Category rather than Product SKU) than the fact table.

High Level Architecture

Most important dimensional modeling lesson: A single fact table must never contain measures at different levels of granularity.