Star vs. Snowflake in OLAP Land

About six months ago I had a discussion with another guy about what my preferred data warehouse schema was: snowflake or star. Without hesitation I said snowflake. He looked at me with befuddlement and asked why. I told him that OLAP processes dimensions more efficiently against a snowflaked schema instead of a star. We had nearly a twenty minute discussion exactly why Analysis Services likes snowflakes better than stars but I failed to convince him.  He firmly believed that the star schema was superior and anything short of me taking his firstborn hostage wouldn’t change his belief in that. Star vs snowflake usually initiates that type of steadfastness.
 
To back up my believe I put together a test. I created a dimension with three levels with each level having two attributes that were outside of the "Advertiser-Ad Campaign-Banner Ad" hierarchy. A total of nine attributes in the dimension.
 
1. Advertiser (15k rows)
  • PaymentType (2 rows)
  • Status (2 rows)
2. Sales Campaign (500k rows)
  • Category(16 rows)
  • TargetCountry (225 rows)
3. Banner Ad (12 million rows)
  • Size (8 rows)
  • AdType (4 rows)
An Advertiser has Sales Campaigns and a Sales Campaign has Banner Ads with Banner Ads being the attribute key. 
 
-In star schema land this would all be put into a single table having nine columns across.
-In the snowflake world this ends up in nine different tables.  There’d be an Advertiser table with three columns: Advertiser/PaymentType/Status, a PaymentType table with a single column PaymentType, a Status table with a single column Status, ect (generically speaking).

This dimension has more levels and more attributes but I decided to pare it down for simplicity. The test was run on an Intel Xeon 2.8 with 4GB of RAM with SQL Server RDBMS and Analysis Services on the same box. I created two different dimensions: one based off of a star schema and the other off of a snowflake schema (separate DSV’s). Each dimension was processed nine times (three Process Full, six Process Update) and the times averaged.
 
The star based dimension averaged a total of 8:35 per process vs 6:42 for the snowflaked based dimension. Why the big difference?
 
Each attribute runs a SELECT DISTINCT against the source dimensional table. Take the AdType attribute under the Banner Ad level. Against a star schema this SELECT DISTINCT query would execute against a table with 12 million rows, however, against a snowflake schema it would execute on a table with only four rows.
 
For smaller dimensions this doesn’t matter much, however if you have large dimensions and update quite frequently such as we do (hourly) a snowflake schema can make a world of difference.
 
Snowflakes are harder to read and tougher for the ETL guy to write, however dimensions process much faster against them. Also, if you’re building a dimension using a wizard (shame on you!) the wizard will be able to detect natural hierarchies whereas a star schema won’t.
 
Of course this test was done in an afternoon and not under the most scrutinizing conditions so I’m curious as to what others have experienced or think about the subject. What are your thoughts?
 
 
  
OLAP: How to Index Star/Snowflake Schema Data:
http://support.microsoft.com/kb/199132
 
 
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Star vs. Snowflake in OLAP Land

  1. Kevin says:

    Did you try making indices at all on those columns for the star dimension? If you had done that, then I do not think it would be much different since it wouldn\’t be scanning 12 million rows anymore.  Generally with star schemas, those indices are made because data is not transactional and those select distincts work well.  Plus, sometimes loading the data using the \’by table\’ can be faster too of there are a lot of different columns, however 12 million rows sounds like an awful lot of permutations for a dimension.

  2. Unknown says:

    I agree with the other comment: if you are concerned about performance, creating indexes can solve you issue. The decision between snowflake and star schema should be related primarily to your specific requirements. In general, star schema is better because it does solve a lot of design issues that emerges when you need to change the model or to handle SCD Type II.
    ETL are also easier to build and to maintain with star schemas.
    I prefer to consider snowflake schemas as exceptions, useful only when their use give advantages that are not acheivable by simply using other techniques like indexes.
     
    Marco

  3. Mosha says:

    For the star schema – have you tried setting ProcessingGroup = Table ? It will do single SELECT query for all attributes together instead of doing SELECT DISTINCT for every attribute.

  4. Igor says:

    From SQL Server 2008 Books Online (February 2009):"""Avoid setting the \’ProcessingGroup\’ dimension property to \’ByTable\’Best Practices Recommendations: In most cases, the default setting of ByAttribute will result in faster processing performance and better memory utilization. The ByTable setting is designed to boost performance in a limited set of scenarios by reducing the load on the relational database. However, the ByTable setting uses a lot of Analysis Services resources and can result in scalability issues at very large volumes."""

  5. Pingback: Csillag vagy hópehely séma? - Kővári Attlia szakmai blogja - devPortal

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s