Thursday, January 20, 2011

Dimensional Modeling

Q. What is dimensional modeling?
Ø  It is a logical design technique used for building data-warehouses. It uses the concept of facts and dimensions. Dimensional Modeling is intended to support end-user queries in a data-warehouse.
Dimensional modeling visualizes the data in terms of Cubes. For example, if we want to measure Sales by Products, Customers, Location, And Time of a company. We can visualize that the x, y a z-axis of the cube depicts products, customers, location and time, every point in the cube depict the sales. This is a very simple way of representing the business.
The Dimensional modeling is also known as Star Schema. The reason is that in Dimensional modeling we have a large central table with many dimension tables surrounding it.

Q. What are the various available schemas for dimensional modeling?


Ø  Star Schema
Ø  Snowflake Schema
Ø  Multistar Schema
Ø  Aggregate schema

Q. What is a Star Schema?


Ø  The star schema is the simplest data warehouse schema. It is called a star schema because the diagram of a star schema resembles a star, with points radiating from a centre. The centre of the star consists of one or more fact tables and the points of the star are the dimension tables. The main advantage to a star schema is optimized performance. A star schema keeps queries simple and provides fast response time because all the information about each level is stored in one row.

Q. What is a Snowflake Schema?

Ø  The snowflake schema is a complex data warehouse schema. A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies. Each dimension level is represented in a table. Snowflake schema implements dimensional data structures with fully normalized dimensions.

Q. What is MultiStar Schema?
Ø  MultiStar Schema: It is various star schemas joined to create a Data Warehouse.

Q. What is a measure/fact?

Ø  A performance indicator that is quantifiable and used to determine how well a business is operating. For example, measures can be Revenue, Revenue/Employee, Profit margin % etc.

Ø  In relational modeling, this is also called a “fact”.
The different types of fact are:
Additive: Data that can be aggregated by using the sum function e.g. Sales
Semi additive: Data that cannot be aggregated directly over time. E.g. Inventory, account balances
Non-additive: Data that cannot be aggregated at all e.g. Time
Q. What are the different types of fact tables?
Ø  The different types of fact table can be listed as below:
a)     Cumulative Fact Table: This type of fact table describes what has happened over a period of time. For Example- This fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The above example given for “Additive Facts” is a Cumulative fact table.
b)    Snapshot Fact Table: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. For Example - The table given above for Semi-additive fact is an example of Snapshot fact table.
Q. What is a Fact-less Fact table?
Ø  A fact-less fact is a fact table in which only primary keys of the dimension tables present as foreign keys but no individual fact columns.
                               
Q.    What is a dimension?

Ø  Dimensions are different perspectives through which a person can analyze the business measures.
Ø  Dimensions contain descriptive data of a business.
Example – Geography, Product, Company, Time

Q.    What are the types of dimensions?
Ø  The different types of dimensions can be listed as below:
a.     Conformed dimension - A dimension which can be shared by multiple fact tables is known as conformed dimension. It has exactly the same meaning and content when being referred from different fact tables.
b.    Junk dimension - A junk dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. In simple terms, a randomly used dimension is junk dimension.
c.     Degenerated dimension – It is a dimension which has only a single attribute. This dimension is typically represented as a single field in fact table. These dimensions are used when fact tables represent transactional data.

Q.    What are Slowly Changing Dimensions?

Ø  Slowly Changing Dimensions are the dimensions in which attribute values keep on changing.
Example – Product price of Products, Address/Phone No. of Employees/Customers

Q.    What are the different types of SCD’s/Different methods to track changes in SCD?

Ø  The different methods/types of SCD’s can be listed as below:

a)     Type 1: Overwriting old value with new value.
In this type of SCD, the old value will be lost.
Example: The phone number of the customer “C1” is changed from “2341233” to 55210456”.

b)    Type 2: Recording new values as new records.
In this type, the new values will be stored into new records. So, Type 2 SCD records both new as well as old values.
Example: The new Phone number for the customer “C1” is recorded again in a new row.
Issue in Type 2 Tracking: In the above example, The Customer ID column has duplicate values and it loses the primary key properties. Here, Surrogate Key comes into picture. Surrogate Key acts as the primary key for the dimension table.

c)     Type 3: Old values are stored in new columns.
In this method, when the values are updated in dimension tables, a new column is created in which the old value gets stored.

Q.    What is a Surrogate key? Where is it used? Explain with example?

Ø  Surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table. The
only requirement for a surrogate primary key is that it is unique for each row in the table.
Data warehouses typically use a surrogate (also known as artificial or identity key) key for the dimension tables primary keys. It is useful because:
a)     The natural primary key (i.e. Customer Number in Customer table) can change.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but not only can these change; indexing on a numerical value is probably better and you could consider creating a surrogate key called say AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.
b)    Another benefit you can get from surrogate keys (SID) is, the Tracking of SCD - Slowly Changing Dimension.
Example: As described above for Type 2 SCD, the phone number of a Customer “C1” changes and it is recorded in a new row; thus making the primary key, “Customer Id” duplicate. Here, a surrogate key would be useful in tracking the change in dimension.
Two new columns, Surrogate Key and Status would help in tracking the change with the status column indicating whether the phone no is current or expired.

 

Q. What are OLAP, MOLAP, ROLAP and HOLAP?

Ø  OLAP:  On-Line Analytical Processing. A category of applications and technologies for collecting, managing, processing and presenting multidimensional data for analysis and management purposes.
Ø  MOLAP: Generally when the OLAP is based on a Multi dimensional server than it is called as MOLAP. An example is Express objects of Oracle, Express Objects has a server, which contains the cubes, this server is then used for reporting, and the cubes contain the data.
·         High performance, multidimensional data storage format.
·         Data is stored on the OLAP server.
·         Gives the best query performance, for small to medium-sized data sets
                                            
Ø  ROLAP: With ROLAP data remains in the original relational tables. A separate set of relational tables is used to store and reference aggregation data. ROLAP is ideal for large databases or legacy data that is infrequently queried.
Example: Business Objects, the relations of the data Warehouse is stored in the Repository, which is actually a set of tables in a RDBMS.
                                              
Ø  HOLAP: It’s the combination of ROLAP and MOLAP.A example is HOLOS.

Q. What is aggregation in a Data Warehouse? What is aggregate navigation?


Ø  Usually in Data Warehouses, the facts store the data by a very low level grain, something like “I want to store the sales of each product by customer by day by store in the fact table”. If this is our grain statement, and suppose we have a base fact of 10 million records, then to calculate the sales of one customer for a period of 3 years would go through many records (actually all the records of that customer for that period). This whole process would be slow. To improve performance of such queries, we design special tables in the Data Warehouse which contain data at higher granularity, in our case we can design a table containing, the monthly sales of each customer by each product. So to calculate the total sales of a Customer for 3 years would become very easy if this table is accessed. Such tables are called aggregates, as data is aggregated.

The challenge comes in for the front-end tools to understand when to look in these tables and when to look in the basic fact table. This is known as aggregate navigation. When an OLAP tool has aggregate navigation feature, it can automatically select the right table depending upon the query fired by the User, without the user knowing anything about the background processes. For example, Business Objects handle aggregate navigation by something known as aggregate aware. This feature allows the Designer to refer to the aggregate table when a query of higher granularity than the base fact is fired.

Q. What are popular OLAP tools?



OLAP Tool
Company
Business objects
Business Objects
Powerplay
Cognos
SAS Software
SAS
Seagate Info
Crystal Decisions
Sagent Data Mart solutions
Sagent Technologies
Oracle Advanced Analytic Services
Oracle
MicroStrategy Intelligence Server
Microstrategy
Microsoft Analysis Services
Microsoft
Intelligent Decision Server
IBM
Hummingbird Pablo
Humming Bird

7 comments:

  1. Started reading your blog today and im loving it!! loved your explanation of the SCDs, surrogate keys and cubes specifically....u've explained stuff so clearly!!! many thanks from readers like me :)))

    ReplyDelete
  2. I thoroughly understood the concepts you explained. This is like a bullet point short notes for me!
    Thanks for all these compositions.

    ReplyDelete
  3. Excellent definitions, very easy to understand

    ReplyDelete
  4. http://chennaitraining.in/android-training-in-chennai/
    http://chennaitraining.in/c-c-plus-plus-training-in-chennai/
    http://chennaitraining.in/ios-training-in-chennai/
    http://chennaitraining.in/iot-training-in-chennai/
    http://chennaitraining.in/full-stack-developer-training-in-chennai/
    http://chennaitraining.in/devops-training-in-chennai/

    ReplyDelete