Analysis Services Distinct Count Optimization
Distinct count (such as unique visitor counts on a Web site) calculations provide valuable information but come with a number of performance challenges. This white paper describes tests that were performed to determine how best to optimize these calculations and includes best practices based on the test results.
While distinct count (such as unique user counts or visitor counts) calculations provide valuable information, these calculations are not easy to solve.
A simple example is a grocery "market basket" scenario where a single visitor can purchase various food items. While many items may have been purchased from your store, knowing whether one or many customers made these purchases has an impact on your marketing campaigns. In a more tech savvy Web analytics scenario, if each visitor has a high number of page views for each visitor on your site, you have a very “sticky” Web site with relatively loyal customers. Financial Web sites are an example of sites with this type of visitor profile where many transactions (research, purchases, sells, options, and so on) are performed by a set number of users. On the other hand, an automotive information Web site may have a relatively low page view/visitor ratio with a high number of visitors. These latter sites may have a lot of “non-sticky” visitors because users are visiting different sites looking for the latest information on their vehicle of choice. Business intelligence in the form of loyalty can be derived from these distinct count calculations where the former site has more loyal customers while the latter has fewer loyal customers. Ultimately, you want to adjust the targeting of your Web advertising campaigns to suit your visitor population. For more information, see Amir Netz’s excellent article Analysis Services: DISTINCT COUNT, Basket Analysis, and Solving the Multiple Selection of Members Problem; while it is specific to Microsoft® SQL Server™ 2000 Analysis Services, it provides useful information on the distinct count and grocery basket analysis problem.
Often distinct counts are calculated by custom applications and pipelines using a variety of parsing techniques to determine the distinct number of items. The distinct count measure was introduced in SQL Server 2000 Analysis Services, an enhancement that was extremely useful for business intelligence analysts. It was originally based on the dimension structure in Analysis Services and provided fast access to distinct count calculations for small-to-medium sized volumes of data. In SQL Server 2005 Analysis Services, there is more flexibility in how you structure data so that you now get even better performance.
While distinct count calculations are much faster in SQL Server 2005 Analysis Services, the purpose of this white paper is to provide a methodology for improving the performance of enterprise-size distinct counts. If you work with a relatively small set of data, you will see performance advantages when using these techniques, but you will not experience the same gain compared to those seen with enterprise-size data. In our case, enterprise size is loosely defined as greater than or equal to 20 million rows a day.
Table of Contents
Benefits of Analysis Services Distinct Count Calculation
Analysis Services Improvements
Distinct Count Optimization Techniques
Create Separate Measure Groups
Create Customized Aggregations
Define a Processing Plan
Create Partitions of Equal Size
Use Partitions Comprised of a Distinct Range of Integers
Distribute the Hash of Your UserIDs
Choose a Partitioning Strategy
Distinct Count Partitioning Strategy Analysis
All Single-User Queries Ranking
Time Set Queries Ranking
Concurrent Queries Ranking
Average of All Rankings
Analysis Services Distinct Count References