Gradient Descent that Increases Loss

Linear regression tutorials using Neural Network packages are easy to find. However, few of these tutorials acknowledge issues with Gradient Descent for estimating even univariate regression models. Here, I review an example that Gradient Descent has problems solving and work through a general solution.

Example

Consider a dataset commonly used in regression classes, the LifeCycleSavings dataset (details here). Running a basic regression linking population age to savings rates does not converge using Gradient Descent.

Using Tensorflow to define the graph:

from pydataset import data
import tensorflow as tf
import numpy as np

savings = data("LifeCycleSavings")
n = np.shape(savings)[0]

inputs = tf.placeholder(tf.float32,shape=[n,1])
outputs = tf.placeholder(tf.float32,shape=[n,1])

beta_coefs = tf.Variable(tf.random_normal((2,1)))

Xmat = tf.concat([tf.ones([n,1]),inputs],axis=1)
yhat = tf.matmul(Xmat,beta_coefs)

loss = tf.reduce_sum(tf.square(yhat - outputs))
sgd = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

init = tf.global_variables_initializer()

Conducting the gradient descent:

X = np.asarray(savings["pop15"]).reshape([n,1])
Y = np.asarray(savings["sr"]).reshape([n,1])

ses = tf.Session()
ses.run(init)

nsteps = 10
nreports = 1

for i in range(nsteps):
  op_out, loss_out = ses.run([sgd,loss],feed_dict={inputs:X,outputs:Y})

  if ((i+1) % nreports) == 0:
      print("Loss =",loss_out)
print("Final coefficients",ses.run([beta_coefs]))

Output:

Loss = 26874.5
Loss = 426156130.0
Loss = 7249693500000.0
Loss = 1.2333102e+17
Loss = 2.0980958e+21
Loss = 3.5692594e+25
Loss = 6.0719852e+29
Loss = 1.0329596e+34
Loss = 1.7572603e+38
Loss = inf
Final coefficients [[2.3496313e+19]
[8.7939524e+20]]

What happened? The loss kept increasing and in only ten iterations loss was numerically determined to be infinity. Also, our estimate of $\beta$ is approaching infinity. The Gradient Descent algorithm actually gradient ascended. Notably, other related Gradient Descent methods such as Adam also perform poorly on this dataset by at least displaying very slow convergence.

The problem with Gradient Descent is not just the learning rate

The most direct way to try to address this problem is to adjust the learning rate. The learning rate was specified as 0.001. Indeed, if the learning rate is 0.00001, the optimization is more well-behaved. However, under this learning rate, the actual model improvement is extremely slow, taking over 100,000 iterations to converge. How can we do better?

Brief analysis shows us that the gradient of the loss for any given $\beta$ can be written as $G(\beta)=-2 X^TY + 2X^TX\beta$ . And, given that Gradient Descent iteratively updates an initial estimate $\beta_0$ in the following way: $\beta_{j+1} = \beta_j - \alpha G(\beta_j)$ we can do some algebra to show that for any $j>0$

$$ \beta_j = (I - \alpha X^TX)^j \beta_0 + \sum_{i=0}^{j-1} \alpha (I - \alpha X^TX)^i X^T Y $$

Condensing this by writiting $A = (I - \alpha X^TX)$ and $B = \alpha X^TY$ , this becomes

$$ \beta_j = A_j \beta_0 + \sum_{i=0}^{j-1} A^i B $$

Therefore, if $\beta_j$ is going to converge as $j$ increases, then $A_j$ needs to get closer and closer to the zero matrix. Analogous to real numbers, a matrix $A$ , upon repeated multiplications to itself will go to zero if its eigenvalues are all less than 1 in absolute value.

When $\alpha = 0.001$ , the eigenvalues of $A=I−\alpha X^TX$ are 0.9997 and −5.5. Therefore, Gradient Descent can’t converge with this learning rate. When $\alpha=0.00001$ , the eigenvalues are 0.99997 and 0.3. This learning rate enables Gradient Descent to converge. But, the leading eigenvalue is extremely close to 1, causing slow convergence.

A solution

The eigenvalues of A are equal to $1-\alpha$ times the eigenvalues of $X^TX$ . $X^TX$ is positive semi-definite (or positive definite); hence its eigenvalues are all $\geq 0$ . The Gershgorin Circle Theorem states that the maximum eigenvalues of a square matrix will be less than or equal to the maximum row-sum of that matrix. This inequality can be used to track whether eigenvalues of $X^TX$ exceed 1 in absolute value.

Standard practice for many ML problems is to normalize all numeric features by subtracting the mean and dividing by their standard deviations. It turns out that this is good practice for this problem as well. Consider $x$ a numeric feature vector and consider normalizing it: $\tilde{x} = (x - \bar{x})/sd(x)$ , where $sd(x)=\sqrt{1/n\sum (x-\bar{x})^2}$ . Note that $n$ is used instead of the more standard $n−1$ here for nicer formulas.

Normalizing all columns of $X$ to get $\tilde{X}$ (and appending a column of 1’s to $X$ to account for an intercept), the following inequalities hold.

$$ \tilde{X}^T\tilde{X} = \left[\begin{array}[c c] .n & \sum \tilde{x}_i \ \sum \tilde{x}_i & n \end{array}\right] $$ $$ \left[\begin{array}[c c] .-n & -n \ -n & -n \end{array}\right] \leq \left[\begin{array}[c c] .n & \sum \tilde{x}_i \ \sum \tilde{x}_i & n \end{array}\right] \leq \left[\begin{array}[c c] .n & n \ n & n \end{array}\right] $$ where the inequalities are taken element-wise.

Therefore, the eigenvalues of A can be guaranteed to be well-behaved if the following loss function is used $$ \frac{1}{n p} || Y - \tilde{X} \beta ||^2 $$ where $p$ is the number of columns of $X$ . Since $\tilde{X}$ is a linear transform of $X$ , minmizing this equation is equavalent to solving the original linear regression problem.

Multiplying by $1/n$ is common practice; however the $1/p$ multiplier needed to here satisfy a condition for convergence of Gradient Descent isn’t commonly stated or used.

In Practice

Adjustments to the graph:

n = np.shape(savings)[0]

inputs = tf.placeholder(tf.float32,shape=[n,1])
outputs = tf.placeholder(tf.float32,shape=[n,1])

beta_coefs = tf.Variable(tf.random_normal((2,1)))

Xmat = tf.concat([tf.ones([n,1]),inputs],axis=1)
yhat = tf.matmul(Xmat,beta_coefs)
p = tf.cast(tf.shape(Xmat)[1],tf.float32)

loss = tf.reduce_sum(tf.square(yhat - outputs)) * 1/( n * p )
squared_error = loss * n * p

# Now that we know the gradient will converge, we can set the learning
# rate to be one.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1).minimize(loss)
init = tf.global_variables_initializer()

Input $X$ with normalized columns

ses = tf.Session()
ses.run(init)

## Normalize X
X = np.asarray(savings["pop15"]).reshape([n,1])
X = (X - np.mean(X))/np.sqrt(np.mean(np.square(X - np.mean(X))))
Y = np.asarray(savings["sr"]).reshape([n,1])

nsteps = 10
nreports = 1

for i in range(nsteps):
    op_out, sq_err = ses.run([optimizer,squared_error],feed_dict={inputs:X,outputs:Y})

    if (i % nreports) == 0:
        print(sq_err)
print("Final coefficients",ses.run([beta_coefs]))

Output:

6068.63
779.511
779.511
779.511
779.511
779.511
779.511
779.511
779.511
779.511
Final coefficients [array([[ 9.67099953],
       [-2.02048302]], dtype=float32)]

Now we get convergence very quickly. And we can verify using sklearn that this value for squared error is indeed the minimum squared error for this $X$ and $Y$ !