前段时间,写了梯度下降的相关算法.其实最小化还有很多其他得方法.这里主要介绍Normal Equation和牛顿法(Newton’s method) $ $

Normal Equation

其实它就是一种解析解的形式,我们还记得那个用来拟合数据得方程

\[ h(x)=\theta^Tx \]

如果不使用$h$,直接使用$y$,以及我们使用的是所用得数据,而不是单个sample,那么新的表达式:

\[ X\theta=y \]

Andrew Ng的课程解释了一大堆,最后求出结果,这里介绍一个容易记忆得方法.上式左乘$X^T$.

\[ X^TX\theta=X^Ty \]

只要保证$X$的列无关,就可以保证$X^TX$的逆存在,因为$X$是我们选择得数据,因此$X$列无关是完全可以做到的.那么:

\[ \theta=(X^TX)^{-1}X^Ty \]

如此以来,代码就很简单了

def Normal_Equation(X,y):
	"""
    batch gradient descent
    :type X:    numpy.array
    :param X:   a dataset(m*n)
    :type y:    numpy.array
    :param y:   a labels(m*1)
    """
	#  insert one column with 1
    X=np.insert(X,0,1,axis=1)
    # ndarray 转 matrix
    X_mat=np.matrix(X)
    tmp=(X_mat.T*X_mat).I*X_mat.T
    tmp=np.array(tmp)
    theta=tmp.dot(y)
    return theta

Newton’s method

牛顿法其实是一种求零点的方法.假设有一函数$l(\theta)$,看一下它的实现过程.

下面写出其的迭代方程

\[ \theta:=\theta-\frac{f(\theta)}{f^{'}(\theta)} \]

不过,这和我们拟合得方程又有什么关系了?这里不妨回想一下梯度下降是怎么求$\theta$的,首先我们还是写出梯度下降得迭代方程

\[ \theta_{j}:=\theta_{j}-\alpha \frac{\partial }{\partial \theta_{j}}J(\theta) \]

我们知道$J(\theta)$是凸函数,上式其实就是在找$J^{'}(\theta)$趋向于0时$\theta$的值,那么我们对$J^{'}(\theta)$进行牛顿法,不也是可以求得$\theta$.接下来给出牛顿法的表达式

\[ \theta:=\theta-\frac{J^{'}(\theta)}{J^{''}(\theta)} \]

化简

\[ \theta:=\theta-H^{-1}\triangledown_{\theta}J(\theta)\\ H_{ij}=\frac{\partial^2 J(\theta)}{\partial \theta_{i}\partial \theta_{j}} \]

最终我们要写出代码,这里还是写出$J^{'}(\theta)$和$J^{''}(\theta)$的具体表达式

\[ J^{'}(\theta)=(h_{\theta}(x^{(k)})-y^{(k)})x_{i}^{k} \qquad (k\quad present\quad k\text{-}th\quad sample)\\ J^{''}(\theta)=x_{j}^{k}x_{i}^{k} \]

公式也上了一大堆了,我还是喜欢代码

def Newton_Method(X,y):
	"""
    batch gradient descent
    :type X:    numpy.array
    :param X:   a dataset(m*n)
    :type y:    numpy.array
    :param y:   a labels(m*1)
    """
    X=np.insert(X,0,1,axis=1)
    m=X.shape[0]
    n=X.shape[1]
    # init theta
    theta=np.zeros((n,1))
    max_iter=100
    # ndarray 转 matrix
    X=np.matrix(X)
    y=np.matrix(y)
    theta=np.matrix(theta)
    for num in range(0,max_iter):
        old_theta=theta
        grad=X.T*(X*theta-y)
        H=X.T*X
        theta=theta-H.I*grad
        diff_theta=theta-old_theta
        if(np.sqrt(diff_theta.T*diff_theta)<0.000001):
            break
    return theta