Overview
The other day on a concall, someone said, “On Monday, assuming no one won the lottery, we’ll regroup and discuss …” It was of course in jest, but it was Friday and it got me day dreaming a bit. Well, I’ll never win the lottery. I’m never lucky in gambling. Too bad there’s no such thing as going short on my slot machine performance. But, since I have a very rare weekend off, I thought, “Why not put all of this BI expertise to work and really try to unlock the stock market?”
First, it’s not as if everyone else handy with databases and statistics who wouldn’t turn away an opportunity for an early retirement (or at least a sabbatical) isn’t already taking on the task. Or that the thousands of money managers and other pros don't spend tens of millions of dollars on such software or don’t spend every moment not eating or sleeping pondering the problem.
However, I've come to learn that pros in any dynamic field (the market, medicine, software, science) don't know everything. This isn't anything really disparaging towards pros in any field. It takes decades of incredible dedication to become really great in a field that doesn't change much such as mastering a musical instrument or sport. How much harder is it if your field keeps getting pulled up from under you (such as in the field of software development)? My point is that with some effort, a beginner in a dynamic field has a realistic chance of finding something great with their "new eyes".
It’s not as if I never gave it an aggressive shot before either. Back in the late 80s, I worked for someone who began playing the market and gave me free reign for a couple of months to develop tools to better play the very hot options market; mostly the OEX. I quickly realized math and logic are only a small part of the problem. It goes way beyond news events too.
Mostly, history just doesn’t mean too much as far as picking stocks is concerned. Meaning, data mining (in the Analysis Services sense) is limited in that world events complicate matters so much that what has happened in the past to move a stock won’t have sufficient probability to move the stock the same way today. More accurately, at least not enough of a probability for me to bet real money on it, even if the algorithm says the probability is 1 (complete certainty). Well gee, so far in my life the historical data says the probability that I will cross the street without being killed by a car is also 1.
Also, there are “games” played amongst the big boys and games that are big-boys versus the amateurs. The former game can be thought of as alpha males killing other alpha males and claiming their goodies as their own. The latter can be thought of as an NFL team playing a high school team. David did beat Goliath, but will David win a two-out-of-three falls match?
The main idea is that predator and prey constantly adapt to outwit each other which mean the rules are always changing. Thus, data mining algorithms, which help to expose rules of the game from large sets of historic results, will only be effective in such a dynamic environment with a whole lot more integration of data, mostly in real-time, from a wide variety of sources (a whole lot more work); which is pretty much how it's done at hedge funds.
So, after a few hours of playing around this weekend, I came to the conclusion that I’d still need way more than a few off weekends to accomplish the goal. (If I did find something that gave me an edge, I certainly wouldn’t be posting this.) However, some of my efforts provide a nice example for using the association rules data mining algorithm of Analysis Services 2005. Perhaps some of the many folks who have attended my BI workshops or who I’ve mentored over the past two years can run with this as an exercise. Of course, I’ll expect a small cut.
The Approach
What I want is to identify sets of conditions that are a leading indicator of a stock movement. For example, if my dog hears rattling of keys and sees me putting on my shoes, it’s a sign of the possibility of going on a walk. For experts in the stock market, this is quite duh. However, for everyday people like me (the amateurs), we tend to be more reactive. Meaning, we tend to become attracted to a stock after it starts rising. So, amateurs normally have one leading indicator and that is “a good stock to buy is one that is rising.” It’s rare that amateurs can take advantage of significant moves once it’s actually noticed.
I’m using the association rules algorithm from Analysis Services 2005 to figure out things like: On days that INTC rises significantly, MSFT often rises significantly too. However, I don’t intend to use that knowledge to simply buy MSFT when I notice INTC going up. There’s not much value in that in itself. The value arises in realizing there is a connection between the fortunes of MSFT and INTC. So, if there is a situation where INTC announces its quarterly earnings before MSFT, I’ll have a good clue that MSFT’s quarter will be similar. This won’t work well on stocks as well-known as these two. If INTC announced a great quarter, MSFT will immediately start rising. But perhaps there are pairs that go relatively unnoticed.
I can also create a web of relationships using these pairs. For example, hypothetically, if I see INTC and MSFT go up and down together very often, and I see that INTC and HP also go up and down together very often, there lies a small web: MSFT-->INTC-->HP.
This is the market basket analysis approach. In the normal use of market basket analysis, analysts want to know which products are purchased together (ex: milk and cereal, pizza and beer, steak and Cabernet Sauvignon) in the same "shopping basket". That way, the products could be placed near each other or promotion campaigns could be engineered for the two products. For more on market basket analysis (using OLAP), see Amir Netz's classic article, Analysis Services: DISTINCT COUNT, Basket Analysis, and Solving the Multiple Selection of Members Problem.
Finding Stock Historical Data
So, I need data; free data since this is only a way to spend my Saturday. I found a free source of historical data somewhere I shouldn’t mention since I work for MSFT. Let’s just say I said something like “Yippee!” when I found the source of data. Unfortunately, the data only includes open, close, high, low, and volume. It’s missing among others the very important P/E. That’s OK, this is a good start.
I downloaded the histories of a few stocks I watch which I may or may not own and a few others that I suspected may confirm a thought I had. The histories are in CSV files and I named them STOCK.CSV (the stock symbol-dot-CSV). For example, MSFT is MSFT.CSV. To analyze this data, I need to get it into a database; a SQL Server database of course. I created a very simple SSIS package that loops through the directory of the CSV files and loads the data into a table named DailyFigures.
What sort of events would I like to “basket” by day? As I mentioned, it would be of value to know what stocks move significantly on the same day. That is, whether both are up, both are down, or one is up and the other down. A twist would also be, what stocks moved significantly a day after another stock moved significantly? Volume reflects interest in a stock, so it would be valuable to through into the basket stocks with significantly higher volume than the previous day.
To do this, I need to create a table where each row represents each of these events for each stock. This table will be used as a nested table in the mining structure. These are the items in the basket.
The event set has been limited to the 21st century, which includes 2000 in this case. This is because the world changed greatly around this time with remarkable events like the dot-com crash, 9/11, and maybe even the turning of the millennium itself.
Setting Up
You will need the following loaded to run this sample:
- At least Windows XP.
- SQL Server 2005 Relational Engine (At least the Developer edition).
- Analysis Services 2005 (SSAS)
- Integration Services 2005 (SSIS)
- SQL Server 2005 Client tools; for the SQL Server Management Studio (SSMS) and the Business Intelligence Development Studio (BIDS).
For those who hardly know their way around SSMS and BIDS, this sample will be easy to install if you install under these conditions:
- You are logged on as an Administrator of the computer you’re testing on.
- You load this in the C:\ directory. The material is quite small.
- SQL Server and Analysis Services are installed as default instances (localhost).
I think almost every developer is set up such that those requirements can be met. If any of those requirements aren’t met, you’ll need to tweak connection strings and directories of the data sources and connection managers respectively in the SSAS and SSIS solutions.
Download and Extract the Material
Download the small attached Zip file, StockSample.zip, and extract it to C:\. Be sure “Use Folder Names” is checked (for WinZip). You find the following:
- CSV files containing history for selected stocks.
- T-SQL scripts to generate the StockSample database.
- An Integration Services project that will load the stock history contained in the CSV files into the StockSample database.
- An Analysis Services project with an association data mining model.
Create the StockSample Database
- Create the database. Open the SQL Server 2005 Management Studio. In the “Connect to Server” dialog box, be sure “Database Engine” is selected as the Server Type.
- Open a query window to execute the script that will generate the StockSample database. Click “New Query”.
- Create the StockSample database. Click the “Open File” icon and select “C:\StockSample\CreateStockSampleDatabase.sql”. Click “OK”.
- Click the “Execute” icon.
- Add the StockSample objects. Click the “Open File” icon and select “C:\StockSample\CreateStockSampleObjects.sql”. Click “OK”.
- Click the “Execute” icon.
- Four objects are created in the StockSample database:
- dbo.DailyFigures table – This table holds stock quotes for each day for each stock downloaded. I don’t mean a tick by tick account, but the high, low, open, close, and volume for the day.
- dbo.Dates table – This table holds one row for each date where there are stock quotes. I added this table because there may be other attributes of a day I’d like to add.
- dbo.vwDailyFigures – This view creates a list of events of the dbo.vwDailyFiguresFilter view that form the core of the “market basket analysis”. These events are what the data mining algorithm works on (not the quotes themselves). More on this view below.
- dbo.vwDailyFiguresFilter – This view filters the events from the dbo.DailyFigures table down to a specific date range. I downloaded pretty much the entire history for each stock. However, I want to focus on more recent history since relationships change over time. This is especially true since 2000 after significant events such as the dot-com crash and 9/11. I have this filter set to begin on 1/1/200.
- Close the SQL Server Management Studio.
Closer Look at the vwDailyFigures View
This view consists of a UNION of select statements, each of which creates events for a particular day. This view is what is referenced by the association data mining algorithm.
--All days for all stocks that rose more than 3%.
--This will create events with descriptions like "MSFT+3".
SELECT
DateID, Stock + '+3' AS Event
FROM dbo.vwDailyFiguresFilter AS DF
WHERE
([Close] > [Open] * 1.03)
UNION ALL
--All days for all stocks that fell more than 3%.
--This will create events with descriptions like "MSFT-3".
SELECT
DateID, Stock + '-3' AS Event
FROM dbo.vwDailyFiguresFilter AS DF
WHERE
([Open] - [Close] > [Open] * .03)
UNION ALL
--All days for all stocks that will rise more than 3% on the next day.
--This will create events with ddescriptions like "MSFT+3 D1".
SELECT
DF.DateID, DF.Stock + '+3 D1' AS Event
FROM
dbo.vwDailyFiguresFilter AS DF INNER JOIN
dbo.Dates AS D ON D.Date = DF.DateID INNER JOIN
dbo.Dates AS D1 ON D1.DateFormatted = DATEADD(day, 1, D.DateFormatted) INNER JOIN
dbo.vwDailyFiguresFilter AS DFD1 ON DFD1.DateID = D1.Date AND DFD1.Stock = DF.Stock
WHERE
(DFD1.[Close] > DFD1.[Open] * 1.03)
UNION ALL
--All days for all stocks that had more than 50% more volume than the previous day.
--This will create events with descriptions like "MSFT+V50".
SSELECT
DF.DateID, DF.Stock + '+V50' AS Event
FROM
dbo.vwDailyFiguresFilter AS DF INNER JOIN
dbo.Dates AS D ON D.Date = DF.DateID INNER JOIN
dbo.Dates AS D1 ON D1.DateFormatted = DATEADD(day, - 1, D.DateFormatted) INNER JOIN
dbo.vwDailyFiguresFilter AS DFD1
ON DFD1.DateID = D1.Date AND
DFD1.Stock = DF.Stock AND DF.Volume > DFD1.Volume * 1.5
For example, the first SELECT statement lists all days for all stocks that closed more than 3% higher than its open price. The following SELECT is pretty much the same except it should close more than 3% lower.
There are really two main types of events. Those which happened on the same day and those which happen some time later. The third SELECT is an example of the latter. The third SELECT finds all days where a stock rose more than 3% one day AFTER a day. We could then see that if some stock experiences some event on a given day, the following day often shows a particular stock making a 3% gain.
How did I pick 3% or volumes of 50%? Just a shot in the dark. I needed events that were rare enough to show something special, but not so rare. The 3% and 50% figures were determined by trial and error. For example, I started volume with 10% rises. But it seems volume fluctuates significantly day to day, so there were too many cases coming up. 50% seemed about right.
Load the Stock History
- Open “C:\StockSample\LoadstockHistory\LoadStockHistory.sln”.
- In the Solution Explorer, open LoadDirectory.dtsx by double-clicking on it. This package will iterate through all of the .CSV files in the StockSample directory and load the data.
- Execute the Package. From the Menu Bar, select Debug\Start Debugging.
- Close this solution.
Open the Data Mining Model
- Open “C:\StockSample\StockAnalysis\StockAnalysis.sln”.
- In the Solution Explorer, expand the “Mining Structures” node.
- Open the Dates.dmm mining structure by double-clicking it.
- Select the “Mining Model Viewer” tab. BIDS will prompt you to deploy the project and process the mining model.
- When the processing completes, you will be in the “ItemSet” tab. This tab lists occurrences of pairs and triplets of events that occurred on the same day (the day is the basket) or simply the number of times an event occurred.
Observations
From this simple data, there are quite a few fun insights:
- SIRI, XMSR, and AMD fell by three percent over 400 times each. These volatile stocks are good for a crazy day-trader. Contrast this to the relatively tame MSFT that has gone up 3% 94 time and down 84 times.
- With 213 supported cases, we see that SIRI’s and XM’s fortunes fall together even though some analysts say their business models are different enough to be looked at separately
- AMD rose 3% 66 times a day after SIRI dropped 3%. Interestingly, AMD is paired with SIRI and XRSR at 139 and 134 cases, respectively. Is there a connection?
A surprise for me is that FDX and CVX aren’t joined at the hip as I suspected. I unfortunately thought FDX’s price was of late mostly a function of the price of gas. I sold my FDX a couple weeks ago thinking the price of gas was as low as it would was going to get, therefore FDX was at a high. I was wrong. But, what do I know? I’m no Jim Cramer. I still made a good profit, though.