Welcome to MSDN Blogs Sign in | Join | Help

Data Mining in the PerformancePoint "MAP" Framework

For the last couple of weeks I’ve had some fun working on a rare (at least for me) data mining engagement. Such engagements are rare for me since the team I’m on focuses on PerformancePoint Server 2007 (PPS), which at this time does incorporate data mining in the PPS Planning application, but it certainly isn’t a big part of the overall “MAP” (Monitor, Analyze, Plan) framework. The data mining functionality of Analysis Services 2005 is still currently viewed as a fringe (or very advanced) component Microsoft BI Stack by many practitioners. That’s a shame since planning involves more than the business-centric forecasting and trending in the current PPS Planning application.

Data Mining offers for those with a strong business and accounting background, but not a strong IT background (such as accountants and especially the MBA types) who hope to consult on PPS implementations an angle by which they can provide high-end value. By “high-end value” I mean the sort of skills in the PPS world that take years to truly master such as the enterprise-class implementations of SQL Server 2005 Analysis Services (SSAS) and Microsoft Office SharePoint Server (MOSS) that will involve issues beyond what rote "best practices" white papers cover. (And I facetiously say, "For which you make the big bucks.") This is as opposed to the relatively easily obtained skill of learning how to build dashboards with the PPS Dashboard Designer or designing Excel BI reports.  Planning is about developing a strategy to resolve a problem.  It is about attempting to see into the future, which is tougher than analyzing the past.

The Monitoring and Analytics (M&A) pieces of the PPS MAP framework have existed for a few years in their former incarnations roughly as the “Business Scorecard Manager” and ProClarity. The Planning piece is brand new. Planning is the toughest (the "P" is MAP) piece to tackle from a software developer's point of view since planning (in the complete sense) is the most complex, variable activity. Therefore, the capabilities of the current PPS Planning application are kind of limited in scope to the semi-defined confines of financial budgeting and financial forecasting. 

Keeping an eye out for something that may be wrong (Monitoring) and then trying to figure out what is wrong (Analyzing) are simpler tasks than trying to figure out what to do about it (Plan). Figuring out what to do about a problem goes well beyond the financed-based capabilities of the PPS Planning application. For example, forecasting the sales of a newly developed product involves many steps and many insightful thought. Forecasting the sales of an existing product is feasible with enough history. But I would think a new product is unique in some way and what makes it new is based on a unique theory. One would start with determining the target customer segment using clustering, products that could induce similar behavior using the association algorithm, then using that information to infer a a sales forecast.

Planning can be for issues within closed or open systems. I bring up this topic because the engagement I was on involved a forecast in a rather open system, so I wish to offer this differentiation. Closed systems are much easier to work with as variables are tightly controlled. Closed systems are finite, therefore mapping relationships of aspects of the closed system is feasible. Examples of closed systems include manufacturing plants and most machinery such as cars. However, closed systems still don’t live in a vacuum, so they are generally encased within systems protecting the consistency of the variables.  Think about all the systems that protect a car's systems from the vagaries of the outside world giving the closed system the illusion that it is truly closed. For example, the radiator keeps the engine at a consistent temperature. 

Open systems include the weather, environmental impact, and worst of all, customer behavior (which is the open system related to this engagement of which I speak). These are systems out in the wild with no systems protecting any mechanisms. Rules revealed through data mining will probably be different under even slightly different circumstances. Relationships would be virtually impossible to map and manage as they would be too numerous and complex. (I often tell this story to illustrate the impossible complexity of the world: How Quickly Things Become Impossibly Complex)

Now think about how complex the forecasting for a new product would be since results are ultimately based on customers actually ending up buying a product. In other words, it's based on customer behavior. How many factors can you think of that affect customer behavior? How many steps are required from the release of an advertisement to the actual purchase?

The big difference a planner will face between open and closed systems is that planning for open systems involve a high degree of "art". By "art" I mean that there is a high degree of versatile human intelligence (as opposed to the deterministic intelligence of most current software) required to decipher a complex system. It's unfair to use the phrase, "more art than science", in the contemptuous manner in which I hear it uttered. One must take care to differentiate art from chaos or superstition.

"Art" is actually the epitome of human intelligence. Art should be thought of as something that deterministic cannot be automated by computer or robot. Putting Jackson Pollack aside, all works of art (in the usual sense of "art") require a great deal of technical skill.

Generally there is less art involved with closed systems. Planning within closed systems can generally follow well-known techniques. There may be things that are hard to see that data mining can reveal. However, once the rules are revealed, they change only when the closed system changes, which is usually in a planned manner.

Planning open systems where there is unimaginable complexity and a lack of control is where versatile intelligence is required. There are three major categories of skill one must develop in order to plan in open systems:

  1. Statistics. Statistics is the foundation of data mining. "Data Mining" are techniques based on statistics.
  2. The use of the data mining tools and techniques. When do you use clustering or decision trees? How can they be used in concert? Two good books are Data Mining with SQL Server 2000 Technical Reference and Data Mining with SQL Server 2005. The first book is still good even though it's using SQL Server 2000 as the analogies are good.
  3. Systems thinking. This is an about face from the compartmentatilized way that we normally attack a problem.

Systems thinking will probably be the toughest since the concepts are sort of fuzzy and not deterministic as we've come to expect from the "scientific" way of thinking. My personal blog site lists my favorite “non-traditional” performance management books that speak to this topic of “data mining is planning” and systems thinking: Non-Traditional PM Books If I had to pick one, I would pick The Fifth Discipline, by Peter M. Senge.

Ultimately, when the install and troubleshooting applications for SSAS and MOSS reach a higher level of maturity (the tasks can be handled in a rote, automated manner - a commoditized, outsourcable skill), data-mining-based planning skill will be what's left that would consist primarily of the versatility of human intelligence and thus reward the practitioner as a high-end skill. A good read on this topic is Super Crunchers.

Posted by EugeneA | 1 Comments

SCL (Soft-Coded Logic) "Where are they Now" Update

It's been over a year and a half since I posted any significant material on SCL. During that period I had actually given up on formally developing it. My "day job" is very much enough to keep me in a constant state of exhaustion. However, that didn't mean I wasn't continuing to develop the ideas. Over the past year and a half or so, I've compiled over 100,000 words of raw notes on thoughts during long walks, about a dozen semi-finished white papers, and a pretty good deal of work on the actual SCL code.

A couple of recent things have encouraged me to toy with breathing a bit of life back into SCL. One is the recent granting of a patent I filed (through Microsoft) way back in 2005 on what is called the "Trade-Off/Semantic Network". Although the TOSN is a different subject, it very much encouraged me to continue on with SCL. I got a very cool plaque from Microsoft. I'll write more about this in the near future as I dust off the UI I had written for it and begin to evangelize its uses.

The other thing is that since the beginning of the year, there has been a significant spike in the number of emails I receive on SCL. The inquiries are from large IT shops, ISVs, and just curious folks. They ask me to clarify whether SCL is a Microsoft product and either way, when will it be released? The answer is "no" and "no current plans", respectively. Some ask me what plans are in the works at Microsoft for similar products. I honestly don't know and probably couldn't and shouldn't say even if I did know. Some ask if there are similar production-worthy products on the market from other vendors. Yes. Some ask if I'm from another planet. Yes.

So it seems that there are definite signs of problems beginning to find this solution. But the rapidly growing BI world keeps me too busy (and that's a good thing) to bring the development of this hobby to full steam. I can maybe squeeze a few hours over the weekends and maybe an hour or two (total) over the weekdays. What I will do is finish the semi-finished white papers and publish them on www.softcodedlogic.com as I finish them. Most of these white papers aren't "how-to" papers with SCL, but more around use-cases and concepts. Additionally, I hope to have a refresh of the SCL SDK around mid summer.

I hope to publish the semi-finished papers at about one every third week or so. They comprise a coherent body of work I'm tentatively referring to as "Complex Systems Intelligence" (not AI), principles for building mainstream applications that execute actions in a more robust manner. These papers address most of the feedback I received early on. Last weekend, I did finish and post the first of those articles titled, Data to Information to Data to Information (DIDI). It's about using SCL to help users read highly complex "cockpit display". 

The next two will be:

  • SCL/RDB, techniques and some code for converting SCL into relational database tables for predicates with a large number of facts.
  • The use of MetaValue Database Columns in notification/alert systems to help cut down on information overload.

Back to a comment I made earlier in passing. It may seem odd to say that I honestly don't know what "SCL-like" efforts are going on at Microsoft or that I don't have much familiarity with products from other vendors that have popped up over the last couple of years. Two years ago when I was making a sincere effort to gain traction for SCL within Microsoft, I was more aware of whatever else was out there to ensure SCL did indeed offer something unique as differentiation is a key factor. With SCL demoted to "hobby" status, there's a huge backlog of upcoming BI-related technologies I must master ahead of SCL competition for which there isn't nearly enough "continuing ed" time.

That sincere effort towards gaining traction for SCL lasted a few months until I couldn't keep up with my full-time-plus "day job" tasks and the full time task of trying to "sell" a grandious product like SCL. As a friend of mine put it, selling a cutting/bleeding edge concept is like climbing a greased pole. You make slow, highly efforted progress, but slide back to the bottom if you slow down at all. The biggest lesson I learned is to have the greatest respect for anyone who builds a business from the ground up.

I certainly didn't give up because I stopped believing in SCL. On the contrary, once the visions of High-Tech Glory were removed from my psyche, SCL became quite a vehicle for pondering "the meaning of life" (I say semi-facetiously and for lack of a better phrase). Hundreds of hours of thinking during hikes on the mechanics of the eternal battle of "one-up-manship" between predator and prey (the philosophical foundation for SCL) lead to some pretty neat insights beyond the realm of IT shops. I'm very grateful to have such a powerful, tangible "model" in which I can ponder such things.

I think .NET 3.x technologies such as LINQ and Windows Worksflow comprise a couple of equivalent bricks in the SCL functionality space. Integrate some of the "Semantic Web" technologies such as OWL and RDF (both XML-based), and the .NET Framework is much closer.

But, I still believe in the foundational concepts upon which I designed SCL. It began with an understanding of what I was trying to accomplish. In a nutshell, I thought there should and could readily be a layer on top of Analysis Services Data Mining as this diagram shows:

 

SSAS Data Mining is very effective for bringing data mining to the masses (I say "for" since the masses do not yet care too much about Data Mining). Data mining is all about creating rules. But what happens when there are too many rules? We then become confused again. A layer sitting on top of these automatically generated rules with the job of inferring heads or tails out of it brings these rules back to a scale which our human brains can easily handle.

With that goal in mind, I borrowed a concept used by the architects of Delphi and Java (and C#). I took an old but well-loved language (Pascal and C++, respectively), stripped out the toughest concepts (such as pointers in C++), and added a few modern features. For SCL, I began with Prolog, one of the two primary AI languages of the 1980s. I stripped out tough concepts such as lists (for now) and added features for our current distributed world (the application world was still monolithic in the 1980s) such as MetaFacts and MetaRules. SCL also lives in a run-time which implements concepts such as "iterative recognition". 

A Prolog "program" is comprised of facts and rules. "Facts" implies data and rules implies business logic. For a Microsoftie such as I, that means SQL Server and the CLR, respectively. Thus, I've chosen to implement SCL as an ADO.NET provider. As an ADO.NET provider, it's easier to intergrate data from SQL Server, Analysis Services (OLAP and Data Mining) as well as living comfortably in a CLR application where it can effectively provide the service it's name suggests, soft-coded logic.

Of course, there are many places in Microsoft products that allow you to enter "rules". Examples are Outlook email rules, PPS business rules, Excel conditional formatting rules, Windows Workflow, BizTalk, etc, etc. One of the sub-themes of the "DIDI" article is that yes, "rules" are something that appear everywhere and there should and can be a single method for creating rules across a very broad spectrum of applications. As I previously mentioned, one of my next white papers is on SCL in notifications and alerts. This would be a nice thing to have in the .NET Framework. 

A friend of mine suggested that using Prolog as a base is bad marketing for SCL. The reasoning is that Prolog and LISP still leave a bad taste in the mouth of those disappointed with the outcome of the fantastic promises made of AI in the 1980s. But I nontheless believe Prolog, like C most elegantly expresses their respective concepts. 

So now that I'm toying with breathing life back into SCL, I suppose I'd better catch up with what else is happening. 

Posted by EugeneA | 0 Comments

Compress Many to Many Relationships (Matrix Solution) C# Application

The SQL Customer Advisory Team released a few items on Codeplex yesterday including a little ditty I wrote:

Compress Many-to-Many C# Utility - This project creates a utility that enables you to easily implement the many-to-many compression technique discussed in the Analysis Services Many-to-Many Dimensions: Query Performance Optimization Techniques best practices white paper. You can download this white paper from the SQL Server Best Practices web site at: http://technet.microsoft.com/en-us/sqlserver/bb671430.aspx

The symptoms for which this application may be useful is where there are many to many relationships in your cube resulting in significant if not downright terrible query performance. Such many to many relationships usually involve a very large intermediate dimension which is most likely a degenerate dimension. Meaning, the intermediate dimension is as large as the fact table. This can result in an intermediate measure group which will probably have more rows than the fact table.

There are a few known limitations that I do intend to address eventually (some are harder than others):

·         Inability to apply this technique to cascading many to many relationships.

·         The underlying data source must be SQL Server.

·         Inability to handle underlying tables from data sources other than the primary data source.

·         Inability to apply necessary schema changes to “intermediate fact tables” that are not real tables; ie views of the relational data source or named queries.

·         The key of the “Many” dimension must be of type int.

Even if this application of no use to you (either because you're not experiencing the problem or the known limitations are a roadblock), there are some crazy examples of AMO coding you may enjoy in this C# application.

 

Simple Analogies that Seemed to Flip the Switch in my SSAS Workshops

There are many mental blocks that attendees of my SQL Server Analysis Services (SSAS) who are new to OLAP concepts face during my workshops. It's usually worse for the advanced relational database folks who must fight their zeal for 3rd Normal Form perfection. Over the years I've tried many different analogies to help the aspiring SSAS developers to overcome these mental blocks. Some worked well for everyone, some succeeded for a few but left the others more confused, and some failed miserably. The analogies range from basic to advanced concepts. (The basic analogies are used in the version of my workshop taking the attendee from a solid Level 100/weak 200 to a solid 200/weak 300, and the advanced analogies are used in the version of my workshop taking the attendee from a solid Level 200/weak 300 to a solid 300 or weak 400 for the exceptional attendees.)

Since the number of SSAS developers is rapidly increasing due to the general growth of BI and the growing success of PerformancePoint Server 2007, I thought I'd create a new category, "Lightbulb BI Thoughts", on this blog site where I'll post these analogies I've used. Many of the posts in this category may be duh to many of my colleagues (so skip anything under this category) or perhaps the metaphors aren't 100%, but it will hopefully help to wire those last  few neuron gaps in the new SSAS developers. Therefore, posts under this category are intended for people in the process of learning OLAP, not as an introduction to OLAP nor to provide tips and tricks for advanced OLAP developers.

I'll begin this category with a graphic I bring up early on in the workshop (at least in the basic version). The mental block it targets is that OLAP data structures are exotic technologies stolen from stranded extra-terrestrial aliens. This mental block usually happens with advanced relational database folks. I can see the gears in their heads constantly trying to figure out how MDX works in terms of SQL.

In a relational database, we find data by sifting through a set of tables, joining them, sifting that result, until we end up with the end result of a table. In OLAP we find data by specifying coordinates much as how we specify the coordinates of a place on a mountain trail with longitude, latitude, and altitude (spatial dimensions). This is an easy way for people to find information. Using a spreadsheet as a two-dimensional example of an OLAP data source with cities on the X axis and dates in the Y axis, we can quickly find the sales amount of a particular city on a particular date by looking for the respective points on the axis of the city and date in question and tracing to the point of intersection.

Arranging data in this multi-dimensional manner makes it easy to find values so we use a language geared towards multi-dimensional access (MDX - Multiple Dimensional eXpressions) to talk to these data sources. If one were simply a consumer of OLAP data (a business analyst), this is all that would need to be said. However, for developers of OLAP cubes (modelers, DBAs), questions around configuration and optimization also arise. That means there are hundreds of questions around what all those dozens of properties in BIDS (the environment where SSAS cubes are developed) mean. Many of those properties make a lot more sense when it is realized that although one talks to the OLAP data source in multi-dimensional terms, the OLAP data source is persisted in a format that is pretty much the same as any relational data store. There are a few sort of exotic structures such as bitmaps and trees, but thought of in relational terms, it's easier to understand the "whys" of those dozens of properties. 

The reason we talk to OLAP data stores in multi-dimensional terms but store the OLAP data in a rather relational manner is due to the "sparsity" of multi-dimensional spaces. Sparsity means that most points in a multi-dimenional space have a value of null. A non-null value occurs at points where something has happened. For example, if there was a sale in Boise for a refrigirator by Brad, the point where Boise, refrigirator and Brad intersect has the value of the sale. For a cube to be "dense", there would need to be a fact for most or all combinations of points. For example, every salesman would need to sell every product every day to every customer in every city in every ... well, you get it. So, if there were 100 salesman, 1000 products, over 1000 days (3 years), to 100,000 customers, and 100 cities, the cube-space would have a quadrillion points, requiring at least four quadrillion bytes to hold a 32-bit int value. One or two dimensions more and the cube space eventually goes into the gazillions.

It's very fortunate that multi-dimensional data sources are sparse. If they weren't, we probably wouldn't have such a thing as OLAP. If I knew nothing about OLAP and someone approached me to develop a multi-dimensional data store, being very honest with myself, I suppose my first thought would be to store the structure as a multi-dimensional array, for example: private int[] cellvalues = new int[1000,1000,1000,100000,100]; If that's how cubes were stored, I certainly wouldn't have been able to create cubes on my old piddly 1 GB RAM, 80 GB hard drive laptop. Conversely, if cubes were dense, that n-dimensional array would then be the best way to store and retrieve OLAP data with little overhead and a direct way to access the cells.

So the multi-dimensional data stores only what exists (not a slot in a multi-dimensional array for every intersection) in structures somewhat similar to relational schemas. The figure below shows how a user requests a portion of the mostly sparse cube-space (a sub cube) in the MDX language and SSAS in turn queries its rather relational data stores in a way akin to SQL. (I stress that by "its rather relational data stores", I'm referring to the fact data, aggregations and variety of indexes SSAS generates when a cube is processed - at least as far as MOLAP is concerned - not the underlying data warehouse data on the far right of the picture.) Understanding this dichotomy of OLAP is one of the foundational steps to OLAP greatness.

Introducing this notion of speaking in multi-dimensional terms to a "virtual" multi-dimensional data source that really stores things in a rather relational format early in the workshop helps get people through the first three days. During Day 1, I take the class through building a very simple cube. What I do differently is that while processing the dimensions and cubes, I go over what the SQL queries to the underlying data source are doing and how that data is stored in the SSAS stores. Of course, this makes going to the point of building a cube with the Cube Design Wizard take the entire day. At that point, we see what a cube looks like and how it's browsed, but we really don't use much MDX. So, Day 1 talks about both the multi-dimensional nature of how we interact with cubes and the rather relational nature of how the OLAP data is stored. Read Chapter 20 - The Physical Data Model - of Microsoft SQL Server 2005 Analysis Services.

Day 2 is about MDX, the language and how to use it for querying, in the MDX Script, KPIs, Actions, and role-based security. We focus solely on the multi-dimensional aspects. In teaching MDX I ensure the attendee is fully cognizant the underlying mission of MDX to explore dimensional space. An empathy for the basic workings of the OLAP engine are stressed; being fully cognizant of what the OLAP engine generally does when recieving the query. At the highest level this means the notion of first determining what the boundaries of the sub-cube we're requesting are (what are the points on each axis), then filling up each point in that cube-space is stressed. Read Chapter 4 - MDX Query Context and Execution - of MDX Solutions.

Day 3 is about partitioning, aggregations and optimization - a complete 180 from Day 2, This means we talk more in terms of relational data stores and not very much in multi-dimensional terms. Measure Group partitions and aggregations are pretty much stored in structures kind of as they would be in a relational database (but more optimized since a relational database like SQL Server needs to be more "general"). So, many of the same relational database concepts apply similarly to optimizing an OLAP cube. For example, fewer joins, narrower tables, etc, are better.

Day 4 and 5 are generally geared towards building a quick proof-of-concept using the customer's data.

Again, this all may be obvious to most reading this post, but I've seen it snap a few synapses in place on more than one occassion.

I'll try to post more of these short insights as often as I can.

Book Review: Super Crunchers, by Ian Ayres

Over the Labor Day weekend, I read the very newly published “Super Crunchers” by Ian Ayres. To me, the theme in a sentence is: numbers don’t lie, but people do (maybe "lie" is too harsh). The author touts the superiority of super crunchers over domain experts. Essentially, he posits that human experts often aren’t as effective at predicting events in their chosen domain as empirical rules gleaned from a massive history of data have proven to be able to do. He suggests that a major if not primary reason statistics can make better decisions than experts is that experts as humans are invested in notions and have allegiances to others, which many times work to subconsciously skew what should otherwise be objective decision-making. This book is about the growing class of analysts skilled in statistics--the “Super Crunchers”--who can take advantage of the relatively recent (and growing) availability of massive amounts of data and the necessary computing power to do something with it. Hence the sub-title of the book: “Why Thinking by Numbers is the New Way to be Smart”.

 

Although an easily digestible and valuable read, this book does concern me in a number of ways. I had this feeling throughout that the author feels super crunching is good, and old-fashioned intuition (human expertise) is bad. The theme running through the author’s many examples is how modern methods of crunching numbers have debunked long running “myths” or outdated methods perpetuated by the so-called experts. 

 

As a BI practitioner, I clearly know it's not yet time to give experts' intuition short shrift. I worry that the author’s slant towards the machine side of the “man versus machine” theme, where man is quickly losing ground to the machines, could exacerbate unwarranted resistance towards data mining by the subjects of the data mining--one large group of its intended beneficiaries being end customers. I see a need for a sincere intent to convey the partnership between man and machine. 

 

Many smart folks who are not business people, such as my retired father-in-law, get a bit bent out of shape at the notion of organizations making heartless decisions based on historic data; whether or not the historic data yields better results. The value of this book is that it effectively articulates to non-BI folks why data warehousing and business intelligence is of great value. For BI practitioners such as myself, I can see borrowing much of his terminology in the field when addressing the non-technical BI audience. The author describes many real successes brought on by crunching huge amounts of data However, many non-BI folks are already expressing their resentment towards data crunching methods, such as with those loyalty cards being issued by so many stores. But these people don’t stop to consider too, that when then asked to recall the last time a product they needed was out of stock at their grocery store shelves, they can’t recall one.

 

As a review of the book in Discover Magazine states, “Although the techniques are well-known, Ayres argues, what’s changing now is that businesses, consumers, and governments are waking up to the power of analyzing enormous tracts of information …” This statement excited me because we in the BI field have been thinking that mainstream understanding was just around the corner for years, and now (hopefully) it’s finally really happening. Reflecting upon just the past three years of my experience delivering OLAP/BI services focused around the Microsoft BI platform, I see the changes:

 

·         Three years ago I would ask my OLAP customers if they would be interested in seeing a demo on data mining. They declined thinking it was too way out of scope for what they were doing. (Keep in mind that I wouldn’t have offered to demo it if I didn’t think it added value to their endeavor.)

·         Two years ago I would offer a data mining demo and the customer would agree to it. They would say it’s interesting, but too far ahead.

·         A year ago after offering and delivering a demo, customers would tell me they plan to implement it.

·         Today, most BI customers ask me to tell them about data mining.

 

If this trend continues, data mining will soon be mainstream, beyond its current use by the largest corporations--the Fortune 1000 sort--who touch all of us already through their extensive applications of data mining. We see the effects with the increasingly superior execution of their services. However, data mining becomes truly mainstream when a wide breadth of corporations incorporates it and the "person on the street" has some level of savvy to protect them from poor data interpretation.

 

There are many examples in the book I believe are unevenly articulated leaving readers with the wrong impression. I’ve made my recommendation to read this book (4 out of 5 stars if I posted this on Amazon), but with the caveat that I think it has a slant towards “machine, good, human intuition, bad”. This review is getting to be much longer than I intended, so I’ll discuss just one which appears towards the end of the book.

 

The theme of superiority of super crunchers versus domain experts seems to take a bit of a somewhat ironic turn in one of the examples. On page 196, Ayres begins a discussion on the effectiveness of Las Vegas sports bookies based on work by Justin Wolfers. In a nutshell, when the point spread was below 12 points, the chance of a team beating the spread seems to fall almost perfectly into the predicted bell curve.

 

However, if the point spread was more than 12, the chance a team would beat the spread fell slightly left of the mean. Justin Wolfers hypothesized that there is point shaving by bribed players. He investigated and found that the point spread was right on track until five minutes before the end of games and felt this could supports the hypothesis. With a large padding with only five minutes to go, a bribed player could shave points without risking a loss for the team.

 

This is a highly premature conclusion. Anyone who has watched team sports knows there are at least three major and legitimate reasons why point production would wane (or the opponent could surge) within the last minutes with a huge lead:

 

1.    A smart coach would not risk having the starters unnecessarily out there risking injury when the game is in the bag.

2.    The bench players can get some much needed field time.

3.    Rubbing your opponent’s face in their huge loss by racking up the points when they are down will make them angry and more determined the next time you meet.

 

The irony is that the super cruncher made a naive or premature hypothesis that would be readily evident to any expert, in this case a sports fan. I'm sure Mr. Wolfers did consider these possibilities as well, but not Mr. Ayres. This further illustrates my concern that this book could stir up unwarranted resistance towards data mining.

 

If you are interested in further exploring how super crunching fits into the whole scheme of things and you are not a BI practitioner, here is a sampling of books I recommend, not for opposing views, but to better place super crunching in its proper context:

 

·         The Social Life of Information – The human species is the ultimate “generalist”. We’re an adaptable species, even though most of our adaptability comes in the form of our vast technology, not primarily in our bodies as with most other creatures. We are also the dominant species on earth. As many other dominant entities, this means our society is mind-bogglingly complex, dynamic, and delicate. Those attributes mean one little change applied too quickly can shatter such a complex machine. Perhaps our innate resistance to change is a mechanism, an immune system, for protecting us from such quick changes. Technologies must be properly socialized.

 

·         Reshaping National Intelligence for an Age of Information, Gregory F. Treverton – This book shows that there is so much data out there that we can practically make a case for anything. Data exists to support any notion and we spin it to suit our needs. Anyone who has watched those shouting matches on Fox News or CNN knows that this is true. Because of this, it does mean there’s even more need for Super Crunching and Super Crunchers. This isn’t an easy read, but it does effectively describe the profoundly critical effects of information overload we must resolve.

 

·         Second Opinions, Jerome Groopman – The theme of this book is the battle between the young doctors armed with the latest and greatest in technique versus the older doctors with years of experience and a few outdated habits that are hard to shake. The young doctors, without a warehouse of stored experience and despite their “best practices” are easily tripped up by “rookie errors”. Older doctors aren’t as easily tripped up, but often fail when they cling to their old ideas that once worked.

 

If six billion people want to live comfortable lives, we need to restructure how systems such as how goods are effectively distributed and how decisions are made are done so in a more scalable manner. That restructuring will come with trade-offs; privacy vs. better, more efficient delivery of good and services. Sometimes each person may draw the short stick, but the overall quality of life increases. Super Crunching is the foundation of this restructuring. I also offer some of my own thoughts in other posts on this site and at my Soft-Coded Logic (SCL) site.

 

Speaking of SCL, here's something only SCL watchers (the two or three out there) may find interesting: In a few months, I plan to post a blog on a topic I call “data mining MetaFacts”, related to my SCL language. Without getting into SCL here, with the data mining MetaFact SCL (Prolog) rules can be added and updated by submitting DMX queries to Analysis Services 2005 data mining stores. For example, a decision tree model would provide the updated rules on an SCL rule defining a "soccer mom". This is completely statistics-based, up-to-date, and robustly integrated with other data and rolled-up facts. I think Ian Ayres will find it very intriguing.

Posted by EugeneA | 1 Comments

Picking Stocks with the Association Algorithm

Overview

 

The other day on a concall, someone said, “On Monday, assuming no one won the lottery, we’ll regroup and discuss …” It was of course in jest, but it was Friday and it got me day dreaming a bit. Well, I’ll never win the lottery. I’m never lucky in gambling. Too bad there’s no such thing as going short on my slot machine performance. But, since I have a very rare weekend off, I thought, “Why not put all of this BI expertise to work and really try to unlock the stock market?”

 

First, it’s not as if everyone else handy with databases and statistics who wouldn’t turn away an opportunity for an early retirement (or at least a sabbatical) isn’t already taking on the task. Or that the thousands of money managers and other pros don't spend tens of millions of dollars on such software or don’t spend every moment not eating or sleeping pondering the problem.

 

However, I've come to learn that pros in any dynamic field (the market, medicine, software, science) don't know everything. This isn't anything really disparaging towards pros in any field. It takes decades of incredible dedication to become really great in a field that doesn't change much such as mastering a musical instrument or sport. How much harder is it if your field keeps getting pulled up from under you (such as in the field of software development)? My point is that with some effort, a beginner in a dynamic field has a realistic chance of finding something great with their "new eyes".

 

It’s not as if I never gave it an aggressive shot before either. Back in the late 80s, I worked for someone who began playing the market and gave me free reign for a couple of months to develop tools to better play the very hot options market; mostly the OEX. I quickly realized math and logic are only a small part of the problem. It goes way beyond news events too.

 

Mostly, history just doesn’t mean too much as far as picking stocks is concerned. Meaning, data mining (in the Analysis Services sense) is limited in that world events complicate matters so much that what has happened in the past to move a stock won’t have sufficient probability to move the stock the same way today. More accurately, at least not enough of a probability for me to bet real money on it, even if the algorithm says the probability is 1 (complete certainty). Well gee, so far in my life the historical data says the probability that I will cross the street without being killed by a car is also 1.

 

Also, there are “games” played amongst the big boys and games that are big-boys versus the amateurs. The former game can be thought of as alpha males killing other alpha males and claiming their goodies as their own. The latter can be thought of as an NFL team playing a high school team. David did beat Goliath, but will David win a two-out-of-three falls match?

 

The main idea is that predator and prey constantly adapt to outwit each other which mean the rules are always changing. Thus, data mining algorithms, which help to expose rules of the game from large sets of historic results, will only be effective in such a dynamic environment with a whole lot more integration of data, mostly in real-time, from a wide variety of sources (a whole lot more work); which is pretty much how it's done at hedge funds.

 

So, after a few hours of playing around this weekend, I came to the conclusion that I’d still need way more than a few off weekends to accomplish the goal. (If I did find something that gave me an edge, I certainly wouldn’t be posting this.) However, some of my efforts provide a nice example for using the association rules data mining algorithm of Analysis Services 2005. Perhaps some of the many folks who have attended my BI workshops or who I’ve mentored over the past two years can run with this as an exercise. Of course, I’ll expect a small cut.

 

The Approach

 

What I want is to identify sets of conditions that are a leading indicator of a stock movement. For example, if my dog hears rattling of keys and sees me putting on my shoes, it’s a sign of the possibility of going on a walk. For experts in the stock market, this is quite duh. However, for everyday people like me (the amateurs), we tend to be more reactive. Meaning, we tend to become attracted to a stock after it starts rising. So, amateurs normally have one leading indicator and that is “a good stock to buy is one that is rising.” It’s rare that amateurs can take advantage of significant moves once it’s actually noticed.

 

I’m using the association rules algorithm from Analysis Services 2005 to figure out things like: On days that INTC rises significantly, MSFT often rises significantly too. However, I don’t intend to use that knowledge to simply buy MSFT when I notice INTC going up. There’s not much value in that in itself. The value arises in realizing there is a connection between the fortunes of MSFT and INTC. So, if there is a situation where INTC announces its quarterly earnings before MSFT, I’ll have a good clue that MSFT’s quarter will be similar. This won’t work well on stocks as well-known as these two. If INTC announced a great quarter, MSFT will immediately start rising. But perhaps there are pairs that go relatively unnoticed.

 

I can also create a web of relationships using these pairs. For example, hypothetically, if I see INTC and MSFT go up and down together very often, and I see that INTC and HP also go up and down together very often, there lies a small web: MSFT-->INTC-->HP. 

 

This is the market basket analysis approach. In the normal use of market basket analysis, analysts want to know which products are purchased together (ex: milk and cereal, pizza and beer, steak and Cabernet Sauvignon) in the same "shopping basket". That way, the products could be placed near each other or promotion campaigns could be engineered for the two products. For more on market basket analysis (using OLAP), see Amir Netz's classic article, Analysis Services: DISTINCT COUNT, Basket Analysis, and Solving the Multiple Selection of Members Problem.

 

Finding Stock Historical Data

 

So, I need data; free data since this is only a way to spend my Saturday. I found a free source of historical data somewhere I shouldn’t mention since I work for MSFT. Let’s just say I said something like “Yippee!” when I found the source of data. Unfortunately, the data only includes open, close, high, low, and volume. It’s missing among others the very important P/E. That’s OK, this is a good start.

 

I downloaded the histories of a few stocks I watch which I may or may not own and a few others that I suspected may confirm a thought I had. The histories are in CSV files and I named them STOCK.CSV (the stock symbol-dot-CSV). For example, MSFT is MSFT.CSV. To analyze this data, I need to get it into a database; a SQL Server database of course. I created a very simple SSIS package that loops through the directory of the CSV files and loads the data into a table named DailyFigures.

 

What sort of events would I like to “basket” by day? As I mentioned, it would be of value to know what stocks move significantly on the same day. That is, whether both are up, both are down, or one is up and the other down. A twist would also be, what stocks moved significantly a day after another stock moved significantly? Volume reflects interest in a stock, so it would be valuable to through into the basket stocks with significantly higher volume than the previous day.

 

To do this, I need to create a table where each row represents each of these events for each stock. This table will be used as a nested table in the mining structure. These are the items in the basket.

 

The event set has been limited to the 21st century, which includes 2000 in this case. This is because the world changed greatly around this time with remarkable events like the dot-com crash, 9/11, and maybe even the turning of the millennium itself.

 

Setting Up

 

You will need the following loaded to run this sample:

 

  • At least Windows XP.
  • SQL Server 2005 Relational Engine (At least the Developer edition).
  • Analysis Services 2005 (SSAS)
  • Integration Services 2005 (SSIS)
  • SQL Server 2005 Client tools; for the SQL Server Management Studio (SSMS) and the Business Intelligence Development Studio (BIDS).

 

For those who hardly know their way around SSMS and BIDS, this sample will be easy to install if you install under these conditions:

 

  • You are logged on as an Administrator of the computer you’re testing on.
  • You load this in the C:\ directory. The material is quite small.
  • SQL Server and Analysis Services are installed as default instances (localhost).

 

I think almost every developer is set up such that those requirements can be met. If any of those requirements aren’t met, you’ll need to tweak connection strings and directories of the data sources and connection managers respectively in the SSAS and SSIS solutions.

 

Download and Extract the Material

 

Download the small attached Zip file, StockSample.zip, and extract it to C:\. Be sure “Use Folder Names” is checked (for WinZip). You find the following:

 

  • CSV files containing history for selected stocks.
  • T-SQL scripts to generate the StockSample database.
  • An Integration Services project that will load the stock history contained in the CSV files into the StockSample database.
  • An Analysis Services project with an association data mining model.

 Create the StockSample Database

 

  1. Create the database. Open the SQL Server 2005 Management Studio. In the “Connect to Server” dialog box, be sure “Database Engine” is selected as the Server Type.
  2. Open a query window to execute the script that will generate the StockSample database. Click “New Query”.
  3. Create the StockSample database. Click the “Open File” icon and select “C:\StockSample\CreateStockSampleDatabase.sql”. Click “OK”.
  4. Click the “Execute” icon.
  5. Add the StockSample objects. Click the “Open File” icon and select “C:\StockSample\CreateStockSampleObjects.sql”. Click “OK”.
  6. Click the “Execute” icon.
  7. Four objects are created in the StockSample database:
    1. dbo.DailyFigures table – This table holds stock quotes for each day for each stock downloaded. I don’t mean a tick by tick account, but the high, low, open, close, and volume for the day.
    2. dbo.Dates table – This table holds one row for each date where there are stock quotes. I added this table because there may be other attributes of a day I’d like to add.
    3. dbo.vwDailyFigures – This view creates a list of events of the dbo.vwDailyFiguresFilter view that form the core of the “market basket analysis”. These events are what the data mining algorithm works on (not the quotes themselves). More on this view below.
    4. dbo.vwDailyFiguresFilter – This view filters the events from the dbo.DailyFigures table down to a specific date range. I downloaded pretty much the entire history for each stock. However, I want to focus on more recent history since relationships change over time. This is especially true since 2000 after significant events such as the dot-com crash and 9/11. I have this filter set to begin on 1/1/200.
  8. Close the SQL Server Management Studio.

 

Closer Look at the vwDailyFigures View

 

This view consists of a UNION of select statements, each of which creates events for a particular day. This view is what is referenced by the association data mining algorithm.

 

--All days for all stocks that rose more than 3%.

--This will create events with descriptions like "MSFT+3".

SELECT    

   DateID, Stock + '+3' AS Event

FROM dbo.vwDailyFiguresFilter AS DF

WHERE

   ([Close] > [Open] * 1.03)

UNION ALL

--All days for all stocks that fell more than 3%.

--This will create events with descriptions like "MSFT-3".

SELECT

   DateID, Stock + '-3' AS Event

FROM dbo.vwDailyFiguresFilter AS DF

WHERE

   ([Open] - [Close] > [Open] * .03)

UNION ALL

--All days for all stocks that will rise more than 3% on the next day.

--This will create events with ddescriptions like "MSFT+3 D1".

SELECT

   DF.DateID, DF.Stock + '+3 D1' AS Event

FROM

   dbo.vwDailyFiguresFilter AS DF INNER JOIN

   dbo.Dates AS D ON D.Date = DF.DateID INNER JOIN

   dbo.Dates AS D1 ON D1.DateFormatted = DATEADD(day, 1, D.DateFormatted) INNER JOIN

   dbo.vwDailyFiguresFilter AS DFD1 ON DFD1.DateID = D1.Date AND DFD1.Stock = DF.Stock

WHERE    

   (DFD1.[Close] > DFD1.[Open] * 1.03)

UNION ALL

--All days for all stocks that had more than 50% more volume than the previous day.

--This will create events with descriptions like "MSFT+V50".

SSELECT    

   DF.DateID, DF.Stock + '+V50' AS Event

FROM        

   dbo.vwDailyFiguresFilter AS DF INNER JOIN

   dbo.Dates AS D ON D.Date = DF.DateID INNER JOIN

   dbo.Dates AS D1 ON D1.DateFormatted = DATEADD(day, - 1, D.DateFormatted) INNER JOIN

   dbo.vwDailyFiguresFilter AS DFD1

      ON DFD1.DateID = D1.Date AND

         DFD1.Stock = DF.Stock AND DF.Volume > DFD1.Volume * 1.5

 

For example, the first SELECT statement lists all days for all stocks that closed more than 3% higher than its open price. The following SELECT is pretty much the same except it should close more than 3% lower.

 

There are really two main types of events. Those which happened on the same day and those which happen some time later. The third SELECT is an example of the latter. The third SELECT finds all days where a stock rose more than 3% one day AFTER a day. We could then see that if some stock experiences some event on a given day, the following day often shows a particular stock making a 3% gain.

 

How did I pick 3% or volumes of 50%? Just a shot in the dark. I needed events that were rare enough to show something special, but not so rare. The 3% and 50% figures were determined by trial and error. For example, I started volume with 10% rises. But it seems volume fluctuates significantly day to day, so there were too many cases coming up. 50% seemed about right.

 

Load the Stock History

 

  1. Open “C:\StockSample\LoadstockHistory\LoadStockHistory.sln”.
  2. In the Solution Explorer, open LoadDirectory.dtsx by double-clicking on it. This package will iterate through all of the .CSV files in the StockSample directory and load the data.
  3. Execute the Package. From the Menu Bar, select Debug\Start Debugging.
  4. Close this solution.

 

Open the Data Mining Model

 

  1. Open “C:\StockSample\StockAnalysis\StockAnalysis.sln”.
  2. In the Solution Explorer, expand the “Mining Structures” node.
  3. Open the Dates.dmm mining structure by double-clicking it.
  4. Select the “Mining Model Viewer” tab. BIDS will prompt you to deploy the project and process the mining model.
  5. When the processing completes, you will be in the “ItemSet” tab. This tab lists occurrences of pairs and triplets of events that occurred on the same day (the day is the basket) or simply the number of times an event occurred.

 

Observations

 

From this simple data, there are quite a few fun insights:

 

  • SIRI, XMSR, and AMD fell by three percent over 400 times each. These volatile stocks are good for a crazy day-trader. Contrast this to the relatively tame MSFT that has gone up 3% 94 time and down 84 times.
  • With 213 supported cases, we see that SIRI’s and XM’s fortunes fall together even though some analysts say their business models are different enough to be looked at separately
  • AMD rose 3% 66 times a day after SIRI dropped 3%. Interestingly, AMD is paired with SIRI and XRSR at 139 and 134 cases, respectively. Is there a connection?

 A surprise for me is that FDX and CVX aren’t joined at the hip as I suspected. I unfortunately thought FDX’s price was of late mostly a function of the price of gas. I sold my FDX a couple weeks ago thinking the price of gas was as low as it would was going to get, therefore FDX was at a high. I was wrong. But, what do I know? I’m no Jim Cramer. I still made a good profit, though.