Ok, I’ve finally had some time to write something on how matrices are used in statistics. As an example, I’ll discuss a technique called ‘multiple regression’ Multiple regression is a tool that is often used to model or predict outcomes based on a linear combination of variables. A biologist, for example, might use multiple regression to predict the number of birds living a various regions as a linear function of the number of predators in those regions and the temperature of those regions. An economist might use regression to predict the GNP in different nations based on employment rates and the number of exports. A Kirupa forum member might use regression to predict the number of posts of forum members from those members date of registration and birthyear.
I’ll use this latter example to illustrate the way multiple regression works.
I collected the following information from the forum on April 17, 2003. The first variable (USER) is the user’s name, the second variable (POSTS) is the number of posts for forum members, the third variable (RD) is the date of registration (coded as the number of months since June 2002), and the fourth variable (DOB) is birth year.
USER POSTS RD DOB
Pom 3976 0 1980
lostinbeta 12483 1 1984
ahmed 929 6 1987
Jubba 2900 0 1983
senocular 2017 6 1979
Guig0 3180 4 1976
david 2592 0 1970
lavaboy 1240 10 1982
Let’s assume we want to model or predict the number of posts made by forum members as a function of their registration date and date of birth. If we were to do so using linear regression, we would write out the following equation:
POSTS = b0 + b1RD + b2DOB
where b0 is the y-intercept in a linear equation. According to this equation, POSTS is a ‘weighted’ function of RD and DOB. If the weight for RD, b1, is large, then RD has a strong linear relationship to POSTS. If it is negative, then RD has an inverse relationship to POSTS (i.e., members with later registration dates have fewer posts). If the weight is 0, then there is no relationship between RD and POSTS.
In matrix form we would write the equation in the following manner:
POSTS = X * B
[3976 ] [1 0 1980] [b0]
[12483] [1 1 1984] * [b1]
[929 ] [1 6 1987] [b2]
[2900 ] = [1 0 1983]
[2017 ] [1 6 1979]
[3180 ] [1 4 1976]
[2592 ] [1 0 1970]
[1240 ] [1 10 1982]
where POSTS is a simple 8 x 1 matrix, B is a 3 x 1 matrix containing the weights or parameters for the equation (b0, b1, and b2), and X is an 8 x 3 matrix containing the predictor variables. (A vector of 1’s is typically appended to the first column of this matrix in order to capture the y-intercept information.)
Once this general equation has been specified, the next step is to ‘estimate’ the weights in the equation (i.e., the values of B) based on empirical data. This is typically done using a technique called ‘least-squares estimation.’ In short, the goal of least squares estimation is to find values of b0, b1, and b2 that minimize the (average squared) differences between the values of POSTS implied by the model and the actual data (i.e., the true number of posts).
One way we could find the least squares values of B would be to simply plug in a variety of different numbers, solve the equation for POSTS, and continue this process until we find some combination of parameters that makes the difference between the predicted value of POSTS and the empirical values of POSTS as small as possible. Doing this, however, would be tedious. Fortunately, some smart statisticans have used multivariate calculus to derive rules for finding a set of weights that will minimize the difference between the predicted and observed values.
In matrix form, the least-squares solution for the matrix B is
Inverse(X’X)XPOSTS.
Notice that this solution requires finding the inverse of the matrix that results from multiplying the transpose of X by X itself. This is one of the reasons I needed a way to calculate the inverse of a matrix (or multi-dimensional array) in ActionScript. Without the appropriate technique, estimating the least-squares weights in this kind of multiple regression problem would be extraordinarily complicated.
To finish the example: If we apply this equation to the data I presented previously, the least-squares solution to B is:
B = [-455214.38 ]
[-542.71 ]
[232.66 ],
implying that the least-squares regression equation for predicting POSTS from RD and DOB is:
POSTS = -455214.38 - 542.71RD + 232.66DOB.
This equation suggests that, if we want to predict the number of POSTS a forum member has, we can begin with the number -455214 and subtract from that the value 542 for each month since he or she has been registered since June 2002 and add 232 times his or her year of birth. In other words, there is a negative relationship between registration date and posts such that the more recenty a user has registered, the fewer posts he or she has. There is a postive relationship with year of birth such that the younger a user is (i.e., the later one was born), the more posts he or she has (specifically, for each increase in birth year we expect 232 extra posts).
We can use the equation to predict the number of posts a user should have. To predict my posts, for example, we would plug my RD (6) and DOB (1972) into the eqation:
predicted POSTS = -455214 - 5426 + 2321972
predicted POSTS = -962
According to this equation, I should have -962 posts. This is clearly off the mark, which raises a second important issue in multiple regression: How well does the model fit or explain the data? When we use least-squares regression to make predictions, we know we will find the optimal weights for the variables in question. That does not imply, however, that we’ll be able to predict the outcome well. (The optimal might not be good in an absolute sense, only a relative sense.) Thus, often times researchers and statisticans compute an index called r-squared. In short, r-squared is an index of how much of the variation between people in the predicted variable (POSTS in this example) can be explained by the variables under study (RD and DOB in this example). I won’t elaborate on how r-squared is computed, but in this example the r-squared is 30%. That suggests that our model can explain some of the variation in POSTS (i.e., 30%), but not all of it. Hence, we might expect substantial differences between the actual number of posts and the model-implied number of posts.
Here are the predicted values for the data discussed previously:
USER POSTS Predicted Posts
Pom 3976 5472
lostinbeta 12483 5860
ahmed 929 3844
Jubba 2900 6170
senocular 2017 1983
Guig0 3180 2370
david 2592 3145
lavaboy 1240 510
Notice that the equation ‘tends’ to make decent predictions, but never quite nails anything on the head. (Example: it under-predicts lostinbeta’s posts and over-predicts Jubba’s posts. It doesn’t do too badly for sen.) Statistically, we assume that the model is incomplete. In other words, there are probably other factors that we would need to take into account in order to fully model POSTS, such as Flash experience, time available to post on forums, Flash expertise, number of problems encountered in Flash projects, etc.
I’m currently writing a program that performs multiple regression analyses in Flash. A working draft of it is available via the link below. I’ll continue to update it as I have time. (I’d like to give the swf the ability to plot the data too.)
Link to Working Draft of Regression Program in Flash MX
Chris