梯度下降在机器学习领域是用得比较多的算法.简单的说就是用一大堆数据去解一个未知的方程,亦或是用一大堆数据取拟合一个方程. $ $

基本概念

首先我们知道一堆数据,但是不知道他们的具体关系(为了方便起见,假设是线性关系). \[ x_{1}^{(1)},x_{2}^{(1)},\cdots,x_{n}^{(1)},y^{(1)}\\ \vdots\\ x_{1}^{(m)},x_{2}^{(m)},\cdots,x_{n}^{(m)},y^{(m)} \] 由于我们假设这些数据是线性关系,因此不妨设其方程为:

\[ h_{\theta}(x)=\theta_{0}*1+\theta_{1}x_{1}+\theta_{2}x_{2}+\dots+\theta_{n}x_{n} \]

这里的$h$其实代指的是$y$,只不过$y$在后面用到了,为了不重复,就用$h$了.我们把其写成向量的形式:

\[ h(x)=\sum_{i=0}^{n}\theta_{i}x_{i}=\theta ^{T}x \]

接下来讲一下,如何求得$\theta$,其实就是要这个方程能尽可能的表示这些数据.那么只要使得$y$和$h$之间的差距尽可能的小就好了,于是构建下面的损失函数(Cost Function)

\[ J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2} \]

很显然$J(\theta)$是凸函数,有极小值.我们只要想办法使这个极小值趋向于0就好了.接下来就是梯度下降(gradient descent)的核心步骤了

\[ \theta_{j}:=\theta_{j}-\alpha\frac{\partial }{\partial \theta_{j}}J(\theta) \]

稍微解释一下,上式的意思就是使得$\theta_{j}$几乎不在变化(就是小于某个阀值).其中$\alpha$是学习率,是固定的.但是这个公式的严格证明小编不太会了.最后写出上述的最简表达式

具体的证明可以看一下Andrew Ng机器学习的课程

\[ \theta_{j}:=\theta_{j}+\alpha(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}^{(i)} \]

批量梯度下降

批量梯度下降(bath gradient descent)顾名思义就是每一次运算都将所有的数据投入运算中.这里给出算法的公式

Repeat until convergence{

\[ \theta_{j}:=\theta_{j}+\alpha\sum_{i=1}^{m}(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}^{(i)} \quad(for\quad every\quad j) \]

}
这些理论的公式,最终还是要写出算法,代码才是王道.

def Gradient_Descent_Batch(X,y):
    """
    batch gradient descent
    :type X:    numpy.array
    :param X:   dataset(m*n)
    :type y:    numpy.array
    :param y:   labels(m*1)
    """
    #  insert one column with 1(这里插入一列是因为x0也是数据的一部分)
    X=np.insert(X,0,1,axis=1)
    m=X.shape[0]
    n=X.shape[1]
    # learning rate(这个选择很重要,要不然无法收敛)
    alpha=0.0001
    max_interation=100
    # init theta
    theta=np.ones((n,1))
    for num in range(0,max_interation):
        grad=X.T.dot((X.dot(theta)-y))
        old_theta=theta
        theta=theta-alpha*grad
        diff_theta=theta-old_theta
        if(np.sqrt(diff_theta.T.dot(diff_theta))<0.00000001):
            break
    return theta

随机梯度下降

随机梯度下降(stochastic gradient descent)与批量梯度下降最大的区别就是每次计算只取一组数据.我们还是先给公式
Loop{
for i =1 to m {

\[ \theta_{j}:=\theta_{j}+\alpha(y^{(i)}-h_{\theta}(x^{(i)}))x_{j}^{(i)} \]

}
}
既然公式完了,那么我们就上代码

def Gradient_Descent_Rand(X,y):
    """
    stochastic gradient descent
    :type X:    numpy.array
    :param X:   a dataset(m*n)
    :type y:    numpy.array
    :param y:   a labels(m*1)
    """
    X=np.insert(X,0,1,axis=1)
    m=X.shape[0]
    n=X.shape[1]
    theta=np.zeros((n,1))
    # ndarray 转 matrix
    X=np.matrix(X)
    y=np.matrix(y)
    theta=np.matrix(theta)
    # 学习率
    alpha=0.000001
    max_interation=100000
    for num in range(0,max_interation):
        old_theta=theta
        # 随机指定一个样本
        j=randint(0,m-1)
        # 这里就体现了批量和随机的区别
        grad=X[j,:].T*(X[j,:]*theta-y[j,:])
        theta=theta-alpha*grad
        diff_theta=theta-old_theta
        if(np.sqrt(diff_theta.T*diff_theta)<0.00000001):
            break
    return theta