About six months ago I had a discussion with another guy about what my preferred data warehouse schema was: snowflake or star. Without hesitation I said snowflake. He looked at me with befuddlement and asked why. I told him that OLAP processes dimensions more efficiently against a snowflaked schema instead of a star. We had nearly a twenty minute discussion exactly why Analysis Services likes snowflakes better than stars but I failed to convince him. He firmly believed that the star schema was superior and anything short of me taking his firstborn hostage wouldn’t change his belief in that. Star vs snowflake usually initiates that type of steadfastness.
To back up my believe I put together a test. I created a dimension with three levels with each level having two attributes that were outside of the "Advertiser-Ad Campaign-Banner Ad" hierarchy. A total of nine attributes in the dimension.
1. Advertiser (15k rows)
- PaymentType (2 rows)
- Status (2 rows)
2. Sales Campaign (500k rows)
- Category(16 rows)
- TargetCountry (225 rows)
3. Banner Ad (12 million rows)
- Size (8 rows)
- AdType (4 rows)
An Advertiser has Sales Campaigns and a Sales Campaign has Banner Ads with Banner Ads being the attribute key.
-In star schema land this would all be put into a single table having nine columns across.
-In the snowflake world this ends up in nine different tables. There’d be an Advertiser table with three columns: Advertiser/PaymentType/Status, a PaymentType table with a single column PaymentType, a Status table with a single column Status, ect (generically speaking).
This dimension has more levels and more attributes but I decided to pare it down for simplicity. The test was run on an Intel Xeon 2.8 with 4GB of RAM with SQL Server RDBMS and Analysis Services on the same box. I created two different dimensions: one based off of a star schema and the other off of a snowflake schema (separate DSV’s). Each dimension was processed nine times (three Process Full, six Process Update) and the times averaged.
The star based dimension averaged a total of 8:35 per process vs 6:42 for the snowflaked based dimension. Why the big difference?
Each attribute runs a SELECT DISTINCT against the source dimensional table. Take the AdType attribute under the Banner Ad level. Against a star schema this SELECT DISTINCT query would execute against a table with 12 million rows, however, against a snowflake schema it would execute on a table with only four rows.
For smaller dimensions this doesn’t matter much, however if you have large dimensions and update quite frequently such as we do (hourly) a snowflake schema can make a world of difference.
Snowflakes are harder to read and tougher for the ETL guy to write, however dimensions process much faster against them. Also, if you’re building a dimension using a wizard (shame on you!) the wizard will be able to detect natural hierarchies whereas a star schema won’t.
Of course this test was done in an afternoon and not under the most scrutinizing conditions so I’m curious as to what others have experienced or think about the subject. What are your thoughts?
A Trio of Interesting Snowflakes by Ralph Kimball:
OLAP: How to Index Star/Snowflake Schema Data: