With the good weather subsiding here in Seattle for a few days I’m feeling motivated to post something today.
Recently there was a post on LinkedIn regarding large dimensional updates (millions of members) in SSAS. The release of Analysis Services 2005 greatly increased the ability to process large dimension compared to AS 2000. Dimensional updates that weren’t possible in 2000 are now easily done in 2005.
I’ve dealt with somewhat large dimensions (five to twenty million members) so I thought I’d share some tips that have allowed me to perform daily updates in an acceptable manner.
When procssing dimensions, it’s not the processing of the dimension members that takes a lot of time, but rather it’s the rebuilding of the aggregates. (set ProcessAffectedObjects to True).
One of our larger dimension has ten million members and when no data shifts the processing of the dimension takes 15 minutes. If data does shift the reprocessing takes three hours. We also have a dimension with only 30 members that will take 15 seconds to reprocess if there is no data movement; two hours if there is. Data will move around if you have flexible attribute relationships.
Things to keep in mind:
1. Carefully examine all flexible relationships, just don’t set everything to flexible. This is set by the RelationshipType property on an attribute.
2. Process all your dimensions in parallel so the aggs are rebuilt at once. If you process dimensions in serial this may happen: dimension A processed, aggs are rebuilt, dimension B processed, data moves around, dimension C processed, aggs need to be rebuilt again. Always try to process dimensions in parallel. ALWAYS.
3. Be creative with your partitioning and aggregation designs. Say you have five years of data and you create a partition for each year. Chances are the most recent year or two will be queried differently than the data from four or five years ago. The partitions for years one and two should use a different aggregation design than years four and five. Rebuilding aggregates can take awhile (especially with larger dimensions) and the less you have to rebuild the faster the dimensional update will take.
For example: I have a cube (partitioned by day) with a rolling one year of data with the majority of queries only being run against the last 90 days. After ninety days I assign a different aggregation design to any partition older than that ninety days and reprocess. The last 90 days use an aggregation design with eight aggs, days 90 and older have a design with only three aggs.
4. Get the fastest box you can. From what i’ve experienced the real limiting factor in fast dimension updates isn’t memory, it’s CPU.
5. Only process dimensions that need to be processed. It’s easy to be lazy and process all the dimensions in a cube every single time, however if the data hasn’t been changed, why update the dimension? I have a stored procedure that takes the last ETL load time for a table and compares that to the last process time for that corresponding SSAS dimension. If the ETL time is later than the cube process time then the SSAS dimension is processed. This greatly cuts down on processing time.
6. To implement most of these suggestions you’ll need to write AMO code. If you want to process your cubes in any sort of efficient way the built-in SSAS tasks within SSIS just don’t cut it. You’ll need custom code.