Homework #3: Least squares (hipsters only)
Posted 3/17; Due 3/31
Background
From the
Ooky-gooky-pedia article
on least squares:
"The method of least squares, also known as regression analysis, is used to
model numerical data obtained from observations by adjusting the parameters of
a model so as to get an optimal fit of the data. The best fit is characterized
by the sum of squared residuals having its least value, a residual being the
difference between an observed value and the value given by the model. The
method was first described by Carl Friedrich Gauss around 1794. Least
squares corresponds to the maximum likelihood criterion if the experimental
errors have a normal distribution. Regression analysis is available in most
statistical software packages."
See also chapter 8 in the Burden and Faires textbook.
For information on the carbon dioxide concentration as measured on the peak
at Mauna Loa, see this
file.
Questions
The basics
- By hand, compute the best linear and quadratic polynomials fitting the data
(0,3), (1,4), (2,6), and (2,7). Compute the least squares errors, and make
rough graph of your results.
- Write code that computes a least squares fit for a given set of data. You
need only be able to fit polynomials. That is, given m data points
(x[i], y[i]), output the coefficients of the
degree n polynomial that best fits the data in a least squares sense.
(NB: a polynomial of degree n has n+1 coefficients. Sounds
silly, I know, but it's easy to forget.) Using Matlab
or Octave will make things easier.
Be sure to handle the possibility that the linear system has no solution; that
is, return some sort of error state.
- Fit the following data using polynomials of degree 1, 2, and 3 and the least
squares method:
1.0, 1.84
1.1, 1.96
1.3, 2.21
1.5, 2.45
1.9, 2.94
2.1, 3.18
- Test your code on the Mauna Loa
data. Using the yearly average data, fit linear and quadratic models to
the data. (For this exercise use the last column titled "Annual-fit" and only
from 1959 on.) Under each model, what predicted CO2 concentration
do you predict by 2050? 2100?
Things I'd like to see you do if you have time (Part I: Simple data, "bad" data)
- Fit the above listed data using a polynomial of degree five. Plot the
resulting fitted function. What does it look like? Taking the linear model
as a baseline, plot the residuals of the fifth order polynomial; that is, plot
the difference between the fifth order model and the first order one.
- Fit the data (0,0), (1,0), ..., (16,0) using polynomials of orders 0 through
10, and plot the results. (With zero data you should get the zero polynomial
in each instance. Boring right?) Repeat using the data (0,0), ..., (7,0),
(8,1), (9,0), ..., (16,0). That is, just move the point at x=8
from y=0 to y=1.
- In the second example above, what happens to the low order models as you
change from one data set to the other? What about the high order models?
What's going on here? There's only a small (relative) change in the data
between the two sets. What's Runge's phenomenon?
(NB: this has nothing to do
with the basis for polynomials that we've chosen. That is, if we use a
different basis, we'll still get the same polynomial (a different set of
coefficients in the basis, for sure, but the same polynomial). Using exact
arithmetic (versus approximate arithmetic like floating point) doesn't change
the wiggly picture. Convince yourself this is the case -- or am I just
messing with you?)
Things I'd like to see you do if you have time (Part II: More complicated
models and some numerical problems)
- Test your code on the Mauna Loa data. This time, though, use the monthly
data, and modify your code to fit a linear combination of polynomials and
sines and cosines of period one year. (NB: Treat -99.99 as a missing value.)
- Is Matlab complaining that the matrix for the normal equations is
ill-conditioned? Good, because it is. But the source of this ill
conditioning isn't entirely in the nature of the problem; the Vandermonde
matrix is ill-conditioned, but not *that* ill-conditioned. Why does using
"scaledMonthlyDates" and adjusting the model fix this problem? (For starters,
how do you use the scaled data, and what adjustments do you need to make to
the model?)
- Another challenging question (that's unrelated): what if you pretend you don't
know the frequency of the periodic variation? (That is, what if you didn't
know the length of the year? Or didn't guess it was an annual variation?)
- In the first question, why didn't I just ask you to fit a sine of period one
year with an unknown phase? (Hint: the sum of a sine and a cosine of the same
period can be rewritten as a single sine with a phase displacement.) What
does this have to do with calculating the phase displacement directly as part
of the least squares fit?
Notes
As before, the first thing I'm gonna do is test your code. If it doesn't
work, you better already know that and have told me in advance. In that
case, you should at least:
- identify that the code's output is incorrect; what about the output tells
you it's incorrect?
- figure out why it's incorrect
- figure out a fix
Note that the last two are often different things.
Also:
- If a question needs clarification, please let me know.
- I'm willing to provide lotsa help and/or guidance. But you need to *ask*.
- Here's code in Matlab
to read in the Mauna Loa data and give you arrays.