封面: Deep Learning深度学习-学习笔记

Deep Learning深度学习-学习笔记

ZL Asica2023-11-18

This notes' content are all based on https://www.coursera.org/specializations/deep-learning

Latex may have some issues when displaying.

1.1 Introduction to Deep Learning

1.1.1 Supervised Learning with Deep Learning

  • Structured Data: Charts.
  • Unstructured Data: Audio, Image, Text.

1.1.2 Scale drives deep learning progress

  • The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
  • Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.

1.2 Basics of Neural Network Programming

1.2.1 Binary Classification

  • Input: XRnxX \in R^{nx}
  • Output: 0, 1

1.2.2 Logistic Regression

  • Given xx, want y^=P(y=1x)\hat{y} = P(y=1|x)

  • Input: xRnxx \in R^{n_x}

  • Parameters: wRnx,bRw \in R^{n_x}, b \in R

  • Output y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)

    • σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}
    • If zz large, σ(z)11+01\sigma(z)\approx\dfrac{1}{1+0}\approx1
    • If zz large negative number, σ(z)11+Bignum0\sigma(z)\approx\dfrac{1}{1+Bignum}\approx0
  • Loss (error) function:

    • y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b), where σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}

      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
    • Want y(i)y^(i)y^{(i)} \approx \hat{y}^{(i)}

    • L(y,y^)=[ylog(y^)+(1y)log(1y^)]L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]

      • If y=1:L(y^,y)=logy^y=1: L(\hat{y}, y)=-\log{\hat{y}} <- want logy^\log{\hat{y}} as large as possible, want y^\hat{y} large
      • If y=0:L(y^,y)=log(1y^)y=0: L(\hat{y}, y)=-\log{(1-\hat{y})} <- want log(1y^)\log{(1-\hat{y})} as large as possible, want y^\hat{y} small
  • Cost function

    • J(w,b)=1mi=1mL(y^(i),y(i))=1mi=1mL[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits_{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

1.2.3 Gradient Descent

  • Repeat w:=wαdJ(w)dww:=w-\alpha\dfrac{dJ(w)}{dw}; b:=bαJ(w,b)bb:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}
    • α\alpha: Learning rate
    • Right side of minimum, dJ(w)dw>0\dfrac{dJ(w)}{dw} > 0; Left side of minimum, dJ(w)dw<0\dfrac{dJ(w)}{dw} < 0
  • Logistic Regression Gradient Descent
    • x1,x2,w1,w2,bx_1,x_2,w_1,w_2,b
      • z=w1x1+w2x2+bz=w_1x_1+w_2x_2+b -->a=σ(z)a=\sigma(z) -->L=(a,y)L=(a,y)
      • da=dL(a,y)da=ya+1y1ada=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}
        • dL(y,a)da=dda(ylog(a)(1y)log(1a))\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))
        • dda(ylog(a))=ya\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}
        • dda((1y)log(1a))=1y1a×(1)=1y1a\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a}
        • =ya+1y1a=yay11a=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}
      • dz=dLdz=dL(a,y)dz=aydz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y
        • =dLdadadz=\dfrac{dL}{da}\cdot\dfrac{da}{dz} (dadz=a(1a)\dfrac{da}{dz}=a(1-a))
      • dLdw1="dw1"=x1dz\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz
      • dLdw2="dw2"=x2dz\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz
      • db=dzdb=dz
  • Gradient Descent on mm examples
    • J(w,b)=1mi=1mL(a(i),y(i))J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(a^{(i)},y^{(i)})
    • w1J(w,b)=1mi=1mw1L(a(i),y(i))\dfrac{\partial}{\partial w_1}J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})
    • J=0;dw1=0;dw2=0;db=0J=0;dw_1=0;dw_2=0;db=0
      • for i=1i=1 to mm
        • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
        • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
        • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
        • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
        • dw1+=x1(i)dz(i)dw_1+=x_1^{(i)}dz^{(i)} (for n = 2)
        • dw2+=x2(i)dz(i)dw_2+=x_2^{(i)}dz^{(i)} (for n = 2)
        • db+=dz(i)db+=dz^{(i)}
      • J/=m;dw1/=m;dw2/=m;db/=mJ/=m;dw_1/=m;dw_2/=m;db/=m
      • dw1=Jw1;dw2=Jw2dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}
        • w1:=w1αdw1w_1:=w_1-\alpha dw_1
        • w2:=w2αdw2w_2:=w_2-\alpha dw_2
        • b:=bαdbb:=b-\alpha db

1.2.4 Computational Graph

  • J(a,b,c)=3(a+bc)J(a,b,c)=3(a+bc)

    • u=bcu=bc
    • v=a+uv=a+u
    • J=3vJ=3v
    • Left to right computation
  • Derivatives with a Computation Graph

    • dJdv=3\dfrac{dJ}{dv}=3
      • dJda=3\dfrac{dJ}{da}=3
      • dvda=1\dfrac{dv}{da}=1
      • Chain Rule: dJda=dJdvdvda\dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}
      • dJdu=3;dudb=2;dJdb=6\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6
      • dudc=3;dJdc=9\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9

1.2.5 Vectorization

  • avoid explicit for-loops.

  • J=0;dw=np.zeros((nx,1));db=0J=0;dw=np.zeros((n_x,1));db=0

    • for i=1i=1 to mm
      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
      • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
      • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
      • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
      • dw+=x(i)dz(i)dw+=x^{(i)}dz^{(i)}
      • db+=dz(i)db+=dz^{(i)}
    • J/=m;dw/=m;db/=mJ/=m;dw/=m;db/=m
  • Z=np.dot(w.T,x)+bZ=np.dot(w.T,x)+b ; b(1,1)-->Broodcasting

  • Vectorization Logistic Regression

    • dz(1)=a(1)y(1);dz(2)=a(2)y(2)...dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...
    • dz=[dz(1),dz(2),...,dz(m)]dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}] 1×m1\times m
    • A=[a(1),a(2),...,a(m)]A=[a^{(1)}, a^{(2)}, ..., a^{(m)}] Y=[y(1),y(2),...,y(m)]Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]
    • dz=AY=[a(1)y(1),a(2)y(2),...]dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]
    • Get rid of dbdb and dwdw in for loop
      • db=1mi=1mdz(i)=1mnp.sum(dz)db=\dfrac{1}{m}\sum\limits_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)
      • dw=1mXdzT=1m[x(1)...][dz(1)...]=1m[x(1)dz(1)+...+x(m)dz(m)]dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}] n×1n\times 1
    • New Form of Logistic Regression
      • Z=wtX+b=np.dot(w.T,X)+bZ=w^tX+b=np.dot(w.T, X)+b
      • A=σ(Z)A=\sigma (Z)
      • dz=AYdz=A-Y
      • dw=1mXdZTdw=\dfrac{1}{m}\cdot X \cdot dZ^T
      • db=1mnp.sum(dz)db=\dfrac{1}{m}np.sum(dz)
      • w:=wαdww:=w-\alpha dw
      • b:=bαdbb:=b-\alpha db
  • Broadcasting(same as bsxfun in Matlab/Octave)

    • (m,n)(m,n)+-*/(1,n)(1,n)->(m,n)(m,n) 1->m will be all the same number.
    • (m,n)(m,n)+-*/(m,1)(m,1)->(m,n)(m,n) 1->n will be all the same number
    • Don't use a=np.random.randn(5)a = np.random.randn(5) a.shape=(5,)a.shape = (5,) "rank 1 array"
    • Use a=np.random.randn(5,1)a = np.random.randn(5,1) or a=np.random.randn(1,5)a = np.random.randn(1,5)
    • Check by assert(a.shape==(5,1))assert(a.shape == (5,1))
    • Fix rank 1 array by a=a.reshape((5,1))a = a.reshape((5,1))
  • Logistic Regression Cost Function

    • Lost
      • p(yx)=y^y(1y^)(1y)p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}
      • If y=1y=1: p(yx)=y^p(y|x)=\hat{y}
      • If y=0y=0: p(yx)=(1y^)p(y|x)=(1-\hat{y})
      • logp(yx)=logy^y(1y^)(1y)=ylogy^+(1y)log(1y^)=L(y^,y)\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)
    • Cost
      • logp(labels in training set)=logΠi=1mp(y(i),x(i))\log p(labels\space in\space training\space set)=\log \Pi_{i=1}^{m}p(y^{(i)},x^{(i)})
      • logp(labels in training set)=i=1mlogp(y(i),x(i))=i=1mL(y^(i),y(i))\log p(labels\space in\space training\space set)=\sum\limits_{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})
      • Use maximum likelihood estimation(MLE)
      • Cost(minmize): J(w,b)=1mi=1mL(y^(i),y(i))J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})

1.3 Shallow Neural Networks

1.3.1 Neural Network Representation

  • deep-learning-notes_1-3-1

  • Input layer, hidden layer, output layer

    • a[0]=xa^{[0]}=x -> a[1]=[[a1[1],a2[1],a3[1],a4[1]]]a^{[1]}=[[a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4]] -> a[2]a^{[2]}
    • Layers count by # of hidden layer+# of output layer.
  • x1,x2,x3x_1,x_2,x_3 -> 4 hidden nodes4\space hidden\space nodes -> Output layerOutput\space layer

    • First hidden node: z1[1]=w1[1]T+b1[1],a1[1]=σ(z1[1])z^{[1]}_1=w^{[1]T}_1+b^{[1]}_1, a^{[1]}_1=\sigma(z^{[1]}_1)
    • Seconde hidden node: z2[1]=w2[1]T+b2[1],a2[1]=σ(z2[1])z^{[1]}_2=w^{[1]T}_2+b^{[1]}_2, a^{[1]}_2=\sigma(z^{[1]}_2)
    • Third hidden node: z3[1]=w3[1]T+b3[1],a3[1]=σ(z3[1])z^{[1]}_3=w^{[1]T}_3+b^{[1]}_3, a^{[1]}_3=\sigma(z^{[1]}_3)
    • Forth hidden node: z4[1]=w4[1]T+b4[1],a4[1]=σ(z4[1])z^{[1]}_4=w^{[1]T}_4+b^{[1]}_4, a^{[1]}_4=\sigma(z^{[1]}_4)
  • Vectorization

    • w[1]=[w1[1]Tw2[1]Tw3[1]Tw4[1]T](4,3)matrixw^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \\ -w^{[1]T}_2- \\ -w^{[1]T}_3- \\ -w^{[1]T}_4- \end{bmatrix}\end{gathered} (4,3)matrix
    • z[1]=[w1[1]Tw2[1]Tw3[1]Tw4[1]T][x1x2x3]+[b1[1]b2[1]b3[1]b4[1]]=[w1[1]Tx+b1[1]w2[1]Tx+b2[1]w3[1]Tx++b3[1]w4[1]Tx+b4[1]]=[z1[1]z2[1]z3[1]z4[1]]z^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \\ -w^{[1]T}_2- \\ -w^{[1]T}_3- \\ -w^{[1]T}_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{[1]}_1 \\ b^{[1]}_2 \\b^{[1]}_3 \\ b^{[1]}_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{[1]T}_1\cdot x+b^{[1]}_1 \\ w^{[1]T}_2\cdot x+b^{[1]}_2 \\ w^{[1]T}_3\cdot x++b^{[1]}_3 \\ w^{[1]T}_4\cdot x+b^{[1]}_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{[1]}_1 \\ z^{[1]}_2 \\z^{[1]}_3 \\ z^{[1]}_4 \end{bmatrix}\end{gathered}
    • a[1]=[a1[1]a2[1]a3[1]a4[1]]=σ(z[1])a^{[1]}=\begin{gathered}\begin{bmatrix}a^{[1]}_1 \\ a^{[1]}_2 \\a^{[1]}_3 \\ a^{[1]}_4 \end{bmatrix}\end{gathered}=\sigma(z^{[1]})
    • z[2]=W[2]a[1]+b[2]z^{[2]}=W^{[2]}\cdot a^{[1]}+b^{[2]} (1,1),(1,4),(4,1),(1,1)(1, 1),(1, 4),(4, 1),(1, 1)
    • a[2]=σ(z[2])a^{[2]}=\sigma(z^{[2]}) (1,1),(1,1)(1,1),(1,1)
    • a[2](i)a^{[2](i)}: layer 22; example ii
  • for i=1 to m:

    • z[1](i)=W[1]x(i)+b[1]z^{[1](i)}=W^{[1]}\cdot x(i)+b^{[1]}
    • a[1](i)=σ(z[1](i))a^{[1](i)}=\sigma(z^{[1](i)})
    • z[2](i)=W[2]a[1](i)+b[2]z^{[2](i)}=W^{[2]}\cdot a^{[1](i)}+b^{[2]}
    • a[2](i)=σ(z[2](i))a^{[2](i)}=\sigma(z^{[2](i)})
  • Vectorizing of the above for loop

    • X=[x(1),x(2),...,x(m)](nx,m)matrixX=\begin{gathered}\begin{bmatrix}| & | & | & | \\ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \\ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrix n is different hidden units
    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=σ(Z[1])A^{[1]}=\sigma(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=σ(Z[2])A^{[2]}=\sigma(Z^{[2]})
    • hrizontally: training examples; vertically: hidden units

1.3.2 Activation Functions

  • g[i]g^{[i]}: activation function of layer ii

    • Sigmoid: a=11+e[z]a=\dfrac{1}{1+e^{[-z]}}
    • Tanh: a=eze[z]ez+e[z]a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}
    • ReLU: a=max(0,z)a=max(0,z)
    • Leaky ReLu: a=max(0.01z,z)a=max(0.01z, z)
  • Rules to choose activation function

    1. Output is between {0, 1}, choose sigmoid.
    2. Default choose ReLu.
  • Why need non-liner activation function

    • Use linear hidden layer will be useless to have multiple hidden layers. It will become a=wx+ba=w'x+b'.
    • Linear may sometime use at output layer but with non-linear at hidden layers.

1.3.3 Forward and Backward Propogation

  • Derivative of activation function

    • Sigmoid: g(z)=ddzg(z)=11+e[z](111+e[z])=g(z)(1g(z))=a(1a)g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)
    • Tanh: g(z)=ddzg(z)=1(tanh(z))2g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2
    • ReLU: g(z)={0if z<01if z0\usepackageundefined\usepackageif z=0g'(z)=\left\{\begin{array}{lr}0&if \space z<0 \\1&if \space z\geq0\\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.
    • Leaky ReLU: g(z)={0.01if z<01if z0g'(z)=\left\{\begin{array}{lr}0.01&if \space z<0 \\1&if \space z\geq0\end{array}\right.
  • Gradient descent for neural networks

    • Parameters: w[1](n[1],n[2]),b[1](n[2],1),w[2](n[2],n[1]),b[2](n[2],1)w^{[1]}(n^{[1]},n^{[2]}), b^{[1]}(n^{[2]},1),w^{[2]}(n^{[2]},n^{[1]}), b^{[2]}(n^{[2]},1)
    • nx=n[0],n[1],n[2]=1n_x=n^{[0]},n^{[1]},n^{[2]}=1
    • Cost function: J(w[1],b[1],w[2],b[2])=1mi=1nL(y^,y)J(w^{[1]}, b^{[1]},w^{[2]}, b^{[2]})=\dfrac{1}{m}\sum\limits_{i=1}^nL(\hat{y},y)
  • Forward propagation:

    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=g[2](Z[2])=σ(Z[2])A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
  • Back Propogation:

    • dZ[2]=A[2]YdZ^{[2]}=A^{[2]}-Y Y=[y(1),y(2),...,y(m)]Y=[y^{(1)},y^{(2)},...,y^{(m)}]

    • dW[2]=1mdZ[2]A[1]TdW^{[2]}=\dfrac{1}{m}dZ^{[2]}A^{[1]T}

    • db[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)db^{[2]}=\dfrac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)

    • dZ[1]=W[2]TdZ[2]g[1](Z1)dZ^{[1]}=W^{[2]T}dZ^{[2]}*g'^{[1]}(Z^{1})

      • (n[1],m)>elementwise product>(n[1],m)(n^{[1]},m)->element-wise\space product->(n^{[1]},m)
    • dW[1]=1mdZ[1]XTdW^{[1]}=\dfrac{1}{m}dZ^{[1]}X^{T}

    • db[1]=1mnp.sum(dZ[1],axis=1,keepdims=True)db^{[1]}=\dfrac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)

  • Random Initialization

    • x1,x2>a1[1],a2[1]>a1[2]>y^x_1,x_2->a_1^{[1]},a_2^{[1]}->a_1^{[2]}->\hat{y}
    • w[1]=np.random.randn((2,2))0.01w^{[1]}=np.random.randn((2,2))*0.01
    • b[1]=np.zeros((2,1))b^{[1]}=np.zeros((2,1))
    • w[2]=np.random.randn((1,2))0.01w^{[2]}=np.random.randn((1,2))*0.01
    • b[2]=0b^{[2]}=0

1.4 Deep Neural Networks

1.4.1 Deep L-Layer Neural Network

  • Deep neural network notation
    • deep-learning-notes_1-4-1
    • L=4L=4 (#layers)
    • n[l]=# units in layer ln^{[l]}= \#\space units\space in\space layer\space l
      • n[1]=5,n[2]=5,n[3]=3,n[4]=n[l]=1n^{[1]}=5,n^{[2]}=5,n^{[3]}=3,n^{[4]}=n^{[l]}=1
      • n[0]=nx=3n^{[0]}=n_x=3
    • a[l]=activations in layer la^{[l]}=activations\space in\space layer\space l
    • a[l]=g[l](z[l]), w[l]=weights for z[l], b[l]=bias for z[l]a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}
    • x=a[0], y^=alx=a^{[0]},\space \hat{y}=a^{l}

1.4.2 Forward Propagation in a Deep Network

  • General: Z[l]=w[l]A[l1]+b[l],A[l]=g[l](Z[l])Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})

    • x:z[1]=w[1]a[0]+b[1],a[1]=g[1](z[1])x: z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}, a^{[1]}=g^{[1]}(z^{[1]}) a[0]=Xa^{[0]}=X
    • z[2]=w[2]a[1]+b[2],a[1]=g[2](z[2])z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}, a^{[1]}=g^{[2]}(z^{[2]})
    • ...
    • z[4]=w[4]a[3]+b[4],a[4]=g[4](z[4])=y^z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}, a^{[4]}=g^{[4]}(z^{[4]})=\hat{y}
  • Vectorizing:

    • Z[1]=w[1]A[0]+b[1],A[1]=g[1](Z[1])Z^{[1]}=w^{[1]}A^{[0]}+b^{[1]}, A^{[1]}=g^{[1]}(Z^{[1]}) A[0]=XA^{[0]}=X
    • Z[2]=w[2]A[1]+b[2],A[2]=g[2](Z[2])Z^{[2]}=w^{[2]}A^{[1]}+b^{[2]}, A^{[2]}=g^{[2]}(Z^{[2]})
    • Y^=g(Z[4])=A[4]\hat{Y}=g(Z^{[4]})=A^{[4]}
  • Matrix dimensions

    • deep-learning-notes_1-4-2
    • z[1]=w[1]x+b[1]z^{[1]}=w^{[1]}\cdot x+b^{[1]}
    • z[1]=(3,1),w[1]=(3,2),x=(2,1),b[1]=(3,1)z^{[1]}=(3,1),w^{[1]}=(3,2),x=(2,1),b^{[1]}=(3,1)
    • z[1]=(n[1],1),w[1]=(n[1],n[0]),x=(n[0],1),b[1]=(n[1],1)z^{[1]}=(n^{[1]},1),w^{[1]}=(n^{[1]},n^{[0]}),x=(n^{[0]},1),b^{[1]}=(n^{[1]},1)
    • w[l]/dw[l]=(n[l],n[l1]),b[l]/db[l]=(n[l],1)w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)
    • z[l],a[l]=(n[l],1),Z[l]/dZ[l],A[l]/dA[l]=(n[l],1)z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1) l=0,A[0]=X=(n[0],m)l=0, A^{[0]}=X=(n^{[0]},m)
  • Why deep representation?

    • Earier layers learn simple features; later deeper layers put together to detect more complex things.
    • Circuit theory and deep learning: Informally: There are functions you can compute with a "small" L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

1.4.3 Building Blocks of Deep Neural Networks

  • Forward and backward functions

    • deep-learning-notes_1-4-3
    • Layer l:w[l],b[l]l:w^{[l]},b^{[l]}
    • Forward: Input a[l1]a^{[l-1]}, output a[l]a^{[l]}
      • z[l]:w[l]a[l1]+b[l]z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]} cache z[l]cache\space z^{[l]}
      • a[l]:g[l](z[l])a^{[l]}:g^{[l]}(z^{[l]})
    • Backward: Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}
  • One iteration of gradient descent of neural network

    • deep-learning-notes_1-4-3-2
  • How to implement?

    • Forward propagation for layer ll

      • Input a[l1]a^{[l-1]}, output a[l],cache (z[l])a^{[l]},cache\space (z^{[l]})

        • z[l]=w[l]a[l1]+b[l]z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}
        • a[l]=g[l](z[l])a^{[l]}=g^{[l]}(z^{[l]})
      • Vectoried

        • Z[l]=W[l]A[l1]+b[l]Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}
        • A[l]=g[l](Z[l])A^{[l]}=g^{[l]}(Z^{[l]})
    • Backward propagation for layer ll

      • Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}

        • dz[l]=da[l]g[l](z[l])dz^{[l]}=da^{[l]}*g'^{[l]}(z^{[l]})
        • dw[l]=dz[l]a[l1]dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}
        • db[l]=dz[l]db^{[l]}=dz^{[l]}
        • da[l1]=w[l]Tdz[l]da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}
      • Vectorized:

        • dZ[l]=dA[l]g[l](Z[l])dZ^{[l]}=dA^{[l]}*g'^{[l]}(Z^{[l]})
        • dW[l]=1mdZ[l]A[l1]TdW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}
        • db[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)
        • dA[l1]=W[l]TdZ[l]dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}

1.4.4 Parameters vs. Hyperparameters

  • Parameters: W[1],b[1],W[2],b[2],...W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]},...

  • Hyperparameters (will affect/control/determine parameters):

    • learning rate α\alpha
    • # iterations
    • # of hidden units n[1],n[2],...n^{[1]},n^{[2]},...
    • # of hidden layers
    • Choice of activation function
  • Later: momemtum, minibatch size, regularization parameters,...

2.1 Practical Aspects of Deep Learning

2.1.1 Train / Dev / Test sets

  • Big data may need only 1% or even less dev/test sets.
  • Mismatched: Make sure dev/test come from same distribution
  • Not having a test set might be okay. (Only dev set.)

2.1.2 Bias / Variance

deep-learning-notes_2-1-2

deep-learning-notes_2-1-2-2

  • Assume optimal (Bayes) error: 0%\approx0\%
  • High bias (underfitting): The prediction cannot classify different elemets as we want.
    • Training set error 15%15\%, Dev set error 16%16\%.
    • Training set error 15%15\%, Dev set error 30%30\%.
  • "just right": The prediction perfectly classify different elemets as we want.
    • Training set error 0.5%0.5\%, Dev set error 1%1\%.
  • High variance (overfitting): The prediction 100% classify different elemets.
    • Training set error 1%1\%, Dev set error 11%11\%.
    • Training set error 15%15\%, Dev set error 30%30\%.

2.1.3 Basic Recipe for Machine Learning

2.1.3.1 Basic Recipe
  • High bias(training data performance)
    • Bigger network
    • Train longer
    • (NN architecture search)
  • High variance (dev set performance)
    • More data
    • Regulairzation
    • (NN architecture search)
2.1.3.2 Regularization
  • Logistic regression. minw,bJ(w,b)\min\limits_{w,b}J(w,b)
    • wRnx,bRw\in\mathbb{R}^{n_x}, b\in\mathbb{R}
    • λ=regularization parameter\lambda=regularization\space parameter
    • J(w,b)=1mi=1mL(y^(i),y(i))+λ2mw22J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||_2
    • L2 regularization w22=j=1nxwj2=wTw||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw
    • L1 regularization λ2mj=1nxwj=λ2mw1\dfrac{\lambda}{2m}\sum\limits_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||_1
      • ww will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
  • Neural network
    • J(w[1],b[1],...,w[l],b[l])=1mi=1mL(y^(i),y(i))+λ2ml=1lw2FJ(w^{[1]},b^{[1]},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits_{l=1}^{l}||w^2||_F
    • w[l]F2=i=1n[l1]j=1n[l](wij[l])2||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2 w:(w[l],w[l1])w: (w^{[l]},w^{[l-1]})
      • Frobenius norm: Square root of square sum of all elements in a matrix.
    • dw[l]=(from backprop)+λmw[l]dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}
      • w[l]:=w[l]αdw[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]} (keep the same)
      • Weight decay
        • w[l]:=w[l]α[(from backprop)+λmw[l]]w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]
        • =w[l]αλmw[l]α(from backprop)=w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)
        • =(1αλm)w[l]α(from backprop)=(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)
  • How does regularization prevent overfitting: λ\lambda bigger w[l]w^{[l]} smaller z[l]z^{[l]} smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
  • Dropout regularization
    • deep-learning-notes_2-1-3-2
    • Implementing dropout("Inverted dropout")
      • Illustrate with layer l=3l=3 keepprob=0.8keep-prob=0.8 (means 0.2 chance get dropout/be 0 out)
      • d3=np.random.rand(a3.shape[0],a3.shape[1])<keepprobd3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep-prob #This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
      • a3=np.multiply(a3,d3)a3 = np.multiply(a3, d3) #a3*=d3; This will let some neruons been dropout
      • a3/=keepproba3/=keep-prob #inverted dropout, keep the total avtivation the same before and after dropout.
    • Why work: Can't rely on any one feature, so have to spread out weights.(shrink weights)
    • First make sure the J is decreasing during iteration, then turn on dropout.
  • Data augmentation
    • Image: crop, flop, twist...
  • Early stopping
    • Mid-size wF2||w||_F^2
    • May caused optimize cost function and not overfir at the same time.
  • Orthogonalization
    • Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
  • Normalizing training sets
    • deep-learning-notes_2-1-3-3
    • x=[x1x2]x=\begin{gathered}\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\end{gathered}
    • Subtract mean:
      • μ=1mi=1mx(i)\mu=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}
      • x:=xμx:=x-\mu
    • Normalize variance:
      • σ2=1mi=1mx(i)2\sigma^2=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}**2 "**" element-wise
      • x/=σ2x/=\sigma^2
    • Use same μ,σ2\mu,\sigma^2 to normalize test set.
    • Why normalize inputs?
      • When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
      • deep-learning-notes_2-1-3-3-2
  • Vanishing/exploding gradients
    • w[l]>Iw^{[l]}>I Just slightly, will make the gradient increase really fast (exploding).
    • w[l]<Iw^{[l]}<I Just slightly, will make the gradient decrease really slow (varnishing).
  • Weight initalization (Single neuron)
    • large nn (number of input features) --> smaller wiw_i
    • Variance(w:)=1nVariance(w:)=\dfrac{1}{n} (sigmoid/tanh) ReLU: 2n\dfrac{2}{n} (variance can be a hyperparameter, DO NOT DO THAT)
    • w[l]=np.random.randn(shapeOfMatrix)np.sqrt(1n[l1])w^{[l]}=np.random.randn(shapeOfMatrix)*np.sqrt(\dfrac{1}{n^{[l-1]}}) ReLU: 2n[l1]\dfrac{2}{n^{[l-1]}}
    • Xavier initialization: 1n[l1])\sqrt{\dfrac{1}{n^{[l-1]}})} Sometime 2n[l1]+n[l])\sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}
  • Numerical approximation of gradients
    • f(θ+ϵ)f(θϵ)2ϵ\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}
  • Gradient checking (Grad check)
    • Take W[1],b[1],...,W[L],b[L]W^{[1]},b^{[1]},...,W^{[L]},b^{[L]} and reshape into a big vector θ\theta.
    • Take dW[1],db[1],...,dW[L],db[L]dW^{[1]},db^{[1]},...,dW^{[L]},db^{[L]} and reshape into a big vector dθd\theta.
    • for each i:
      • dθapprox[i]=J(θ1,θ2,...,θi+ϵ,...)J(θ1,θ2,...,θiϵ,...)2ϵdθ[i]=Jθid\theta_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}
      • Check Euclidean distance dθapproxdθ2dθapprox2+dθ2\dfrac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2} (.2||.||_2 is Euclidean norm, sqare root of the sum of all elements' power of 2)
      • take ϵ=107\epsilon=10^{-7}, if above Euclidean distance is 107\approx10^{-7} or smaller, is great.
      • If is 10510^{-5} or bigger may need to check.
      • If is 10310^{-3} or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
    • notes:
      • Don't use in training - only to debug.
      • If algorithm fails grad check, look at components to try to identify bug.
      • Remember regularization. (include the λ2m\dfrac{\lambda}{2m})
      • Doesn't work with dropout. (since is random, implement without dropout)
      • Run at random initialization; perhaps again after some training. (not work when w,b0w,b\approx0)

2.2 Optimization Algorithms

2.2.1 Mini-batch gradient descent

  • Batch vs. mini-batch gradient descent
    • Normal batch may have large amount of data like millions of elements.
      • set m=5,000,000m=5,000,000
      • X=[x(1),x(2),x(3),...,x(1000),x(1001),...,x(2000),...,x(m)](nx,m)X=[x^{(1)},x^{(2)},x^{(3)},...,x^{(1000)},x^{(1001)},...,x^{(2000)},...,x^{(m)}] (n_x,m)
      • Y=[y(1),y(2),y(3),...,y(m)](1,m)Y=[y^{(1)},y^{(2)},y^{(3)},...,y^{(m)}] (1,m)
    • Mini-batches make 1,000 xx each.
      • Mini-batch number t:X{t},Y{t}t:X^{\{t\}},Y^{\{t\}}
        • x(i)x^{(i)} ith in trainning set, z[l]z^{[l]} layer in network X{t}X^{\{t\}} batch in mini-batch
      • X=[X{1},X{2},...,X{5000}]X = [X^{\{1\}},X^{\{2\}},...,X^{\{5000\}}]
      • Y=[Y{1},Y{2},Y{3},...,Y(5,000)]Y=[Y^{\{1\}},Y^{\{2\}},Y^{\{3\}},...,Y^{(5,000)}]
  • Mini-batch gradient descent
    • 1 step of gradient descent using X{t},Y{t}X^{\{t\}},Y^{\{t\}} (1000)
      • 1 epoch: single pass through training set.
    • for t=1,...,5000for\space t=1,...,5000
      • Forward prop on X{t}X^{\{t\}}
      • Z[1]=W[1]X{t}+b[1]Z^{[1]}=W^{[1]}X^{\{t\}}+b^{[1]}
      • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})