Homework #3: Least squares (hipsters only)

Posted 3/17; Due 3/31

Background

From the Ooky-gooky-pedia article on least squares:

"The method of least squares, also known as regression analysis, is used to model numerical data obtained from observations by adjusting the parameters of a model so as to get an optimal fit of the data. The best fit is characterized by the sum of squared residuals having its least value, a residual being the difference between an observed value and the value given by the model. The method was first described by Carl Friedrich Gauss around 1794. Least squares corresponds to the maximum likelihood criterion if the experimental errors have a normal distribution. Regression analysis is available in most statistical software packages."

See also chapter 8 in the Burden and Faires textbook.

For information on the carbon dioxide concentration as measured on the peak at Mauna Loa, see this file.

Questions

The basics

By hand, compute the best linear and quadratic polynomials fitting the data (0,3), (1,4), (2,6), and (2,7). Compute the least squares errors, and make rough graph of your results.
Write code that computes a least squares fit for a given set of data. You need only be able to fit polynomials. That is, given m data points (x[i], y[i]), output the coefficients of the degree n polynomial that best fits the data in a least squares sense. (NB: a polynomial of degree n has n+1 coefficients. Sounds silly, I know, but it's easy to forget.) Using Matlab or Octave will make things easier.
Be sure to handle the possibility that the linear system has no solution; that is, return some sort of error state.
Fit the following data using polynomials of degree 1, 2, and 3 and the least squares method:
1.0, 1.84
1.1, 1.96
1.3, 2.21
1.5, 2.45
1.9, 2.94
2.1, 3.18
Test your code on the Mauna Loa data. Using the yearly average data, fit linear and quadratic models to the data. (For this exercise use the last column titled "Annual-fit" and only from 1959 on.) Under each model, what predicted CO₂ concentration do you predict by 2050? 2100?

Things I'd like to see you do if you have time (Part I: Simple data, "bad" data)

Fit the above listed data using a polynomial of degree five. Plot the resulting fitted function. What does it look like? Taking the linear model as a baseline, plot the residuals of the fifth order polynomial; that is, plot the difference between the fifth order model and the first order one.
Fit the data (0,0), (1,0), ..., (16,0) using polynomials of orders 0 through 10, and plot the results. (With zero data you should get the zero polynomial in each instance. Boring right?) Repeat using the data (0,0), ..., (7,0), (8,1), (9,0), ..., (16,0). That is, just move the point at x=8 from y=0 to y=1.
In the second example above, what happens to the low order models as you change from one data set to the other? What about the high order models? What's going on here? There's only a small (relative) change in the data between the two sets. What's Runge's phenomenon?
(NB: this has nothing to do with the basis for polynomials that we've chosen. That is, if we use a different basis, we'll still get the same polynomial (a different set of coefficients in the basis, for sure, but the same polynomial). Using exact arithmetic (versus approximate arithmetic like floating point) doesn't change the wiggly picture. Convince yourself this is the case -- or am I just messing with you?)

Things I'd like to see you do if you have time (Part II: More complicated models and some numerical problems)

Test your code on the Mauna Loa data. This time, though, use the monthly data, and modify your code to fit a linear combination of polynomials and sines and cosines of period one year. (NB: Treat -99.99 as a missing value.)
Is Matlab complaining that the matrix for the normal equations is ill-conditioned? Good, because it is. But the source of this ill conditioning isn't entirely in the nature of the problem; the Vandermonde matrix is ill-conditioned, but not *that* ill-conditioned. Why does using "scaledMonthlyDates" and adjusting the model fix this problem? (For starters, how do you use the scaled data, and what adjustments do you need to make to the model?)
Another challenging question (that's unrelated): what if you pretend you don't know the frequency of the periodic variation? (That is, what if you didn't know the length of the year? Or didn't guess it was an annual variation?)
In the first question, why didn't I just ask you to fit a sine of period one year with an unknown phase? (Hint: the sum of a sine and a cosine of the same period can be rewritten as a single sine with a phase displacement.) What does this have to do with calculating the phase displacement directly as part of the least squares fit?

Notes

As before, the first thing I'm gonna do is test your code. If it doesn't work, you better already know that and have told me in advance. In that case, you should at least:

identify that the code's output is incorrect; what about the output tells you it's incorrect?
figure out why it's incorrect
figure out a fix

Note that the last two are often different things.

Also:

If a question needs clarification, please let me know.
I'm willing to provide lotsa help and/or guidance. But you need to *ask*.
Here's code in Matlab to read in the Mauna Loa data and give you arrays.