This notes' content are all based on https://www.coursera.org/specializations/deep-learning

I. Neural Networks and Deep Learning

1.1 Introduction to Deep Learning

1.1.1 Supervised Learning with Deep Learning

  • Structured Data: Charts.
  • Unstructured Data: Audio, Image, Text.

1.1.2 Scale drives deep learning progress

  • The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
  • Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.

1.2 Basics of Neural Network Programming

1.2.1 Binary Classification

  • Input: XRnxX \in R^{nx}
  • Output: 0, 1

1.2.2 Logistic Regression

  • Given xx, want y^=P(y=1x)\hat{y} = P(y=1|x)

  • Input: xRnxx \in R^{n_x}

  • Parameters: wRnx,bRw \in R^{n_x}, b \in R

  • Output y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)

    • σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}
    • If zz large, σ(z)11+01\sigma(z)\approx\dfrac{1}{1+0}\approx1
    • If zz large negative number, σ(z)11+Bignum0\sigma(z)\approx\dfrac{1}{1+Bignum}\approx0
  • Loss (error) function:

    • y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b), where σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}

      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
    • Want y(i)y^(i)y^{(i)} \approx \hat{y}^{(i)}

    • L(y,y^)=[ylog(y^)+(1y)log(1y^)]L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]

      • If y=1:L(y^,y)=logy^y=1: L(\hat{y}, y)=-\log{\hat{y}} <- want logy^\log{\hat{y}} as large as possible, want y^\hat{y} large
      • If y=0:L(y^,y)=log(1y^)y=0: L(\hat{y}, y)=-\log{(1-\hat{y})} <- want log(1y^)\log{(1-\hat{y})} as large as possible, want y^\hat{y} small
  • Cost function

    • J(w,b)=1mi=1mL(y^(i),y(i))=1mi=1mL[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits_{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

1.2.3 Gradient Descent

  • Repeat w:=wαdJ(w)dww:=w-\alpha\dfrac{dJ(w)}{dw}; b:=bαJ(w,b)bb:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}
    • α\alpha: Learning rate
    • Right side of minimum, dJ(w)dw>0\dfrac{dJ(w)}{dw} > 0; Left side of minimum, dJ(w)dw<0\dfrac{dJ(w)}{dw} < 0
  • Logistic Regression Gradient Descent
    • x1,x2,w1,w2,bx_1,x_2,w_1,w_2,b
      • z=w1x1+w2x2+bz=w_1x_1+w_2x_2+b -->a=σ(z)a=\sigma(z) -->L=(a,y)L=(a,y)
      • da=dL(a,y)da=ya+1y1ada=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}
        • dL(y,a)da=dda(ylog(a)(1y)log(1a))\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))
        • dda(ylog(a))=ya\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}
        • dda((1y)log(1a))=1y1a×(1)=1y1a\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a}
        • =ya+1y1a=yay11a=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}
      • dz=dLdz=dL(a,y)dz=aydz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y
        • =dLdadadz=\dfrac{dL}{da}\cdot\dfrac{da}{dz} (dadz=a(1a)\dfrac{da}{dz}=a(1-a))
      • dLdw1="dw1"=x1dz\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz
      • dLdw2="dw2"=x2dz\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz
      • db=dzdb=dz
  • Gradient Descent on mm examples
    • J(w,b)=1mi=1mL(a(i),y(i))J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(a^{(i)},y^{(i)})
    • w1J(w,b)=1mi=1mw1L(a(i),y(i))\dfrac{\partial}{\partial w_1}J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})
    • J=0;dw1=0;dw2=0;db=0J=0;dw_1=0;dw_2=0;db=0
      • for i=1i=1 to mm
        • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
        • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
        • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
        • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
        • dw1+=x1(i)dz(i)dw_1+=x_1^{(i)}dz^{(i)} (for n = 2)
        • dw2+=x2(i)dz(i)dw_2+=x_2^{(i)}dz^{(i)} (for n = 2)
        • db+=dz(i)db+=dz^{(i)}
      • J/=m;dw1/=m;dw2/=m;db/=mJ/=m;dw_1/=m;dw_2/=m;db/=m
      • dw1=Jw1;dw2=Jw2dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}
        • w1:=w1αdw1w_1:=w_1-\alpha dw_1
        • w2:=w2αdw2w_2:=w_2-\alpha dw_2
        • b:=bαdbb:=b-\alpha db

1.2.4 Computational Graph

  • J(a,b,c)=3(a+bc)J(a,b,c)=3(a+bc)

    • u=bcu=bc
    • v=a+uv=a+u
    • J=3vJ=3v
    • Left to right computation
  • Derivatives with a Computation Graph

    • dJdv=3\dfrac{dJ}{dv}=3
      • dJda=3\dfrac{dJ}{da}=3
      • dvda=1\dfrac{dv}{da}=1
      • Chain Rule: dJda=dJdvdvda\dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}
      • dJdu=3;dudb=2;dJdb=6\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6
      • dudc=3;dJdc=9\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9

1.2.5 Vectorization

  • avoid explicit for-loops.

  • J=0;dw=np.zeros((nx,1));db=0J=0;dw=np.zeros((n_x,1));db=0

    • for i=1i=1 to mm
      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
      • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
      • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
      • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
      • dw+=x(i)dz(i)dw+=x^{(i)}dz^{(i)}
      • db+=dz(i)db+=dz^{(i)}
    • J/=m;dw/=m;db/=mJ/=m;dw/=m;db/=m
  • Z=np.dot(w.T,x)+bZ=np.dot(w.T,x)+b ; b(1,1)-->Broodcasting

  • Vectorization Logistic Regression

    • dz(1)=a(1)y(1);dz(2)=a(2)y(2)...dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...
    • dz=[dz(1),dz(2),...,dz(m)]dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}] 1×m1\times m
    • A=[a(1),a(2),...,a(m)]A=[a^{(1)}, a^{(2)}, ..., a^{(m)}] Y=[y(1),y(2),...,y(m)]Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]
    • dz=AY=[a(1)y(1),a(2)y(2),...]dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]
    • Get rid of dbdb and dwdw in for loop
      • db=1mi=1mdz(i)=1mnp.sum(dz)db=\dfrac{1}{m}\sum\limits_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)
      • dw=1mXdzT=1m[x(1)...][dz(1)...]=1m[x(1)dz(1)+...+x(m)dz(m)]dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}] n×1n\times 1
    • New Form of Logistic Regression
      • Z=wtX+b=np.dot(w.T,X)+bZ=w^tX+b=np.dot(w.T, X)+b
      • A=σ(Z)A=\sigma (Z)
      • dz=AYdz=A-Y
      • dw=1mXdZTdw=\dfrac{1}{m}\cdot X \cdot dZ^T
      • db=1mnp.sum(dz)db=\dfrac{1}{m}np.sum(dz)
      • w:=wαdww:=w-\alpha dw
      • b:=bαdbb:=b-\alpha db
  • Broadcasting(same as bsxfun in Matlab/Octave)

    • (m,n)(m,n)+-*/(1,n)(1,n)->(m,n)(m,n) 1->m will be all the same number.
    • (m,n)(m,n)+-*/(m,1)(m,1)->(m,n)(m,n) 1->n will be all the same number
    • Don't use a=np.random.randn(5)a = np.random.randn(5) a.shape=(5,)a.shape = (5,) "rank 1 array"
    • Use a=np.random.randn(5,1)a = np.random.randn(5,1) or a=np.random.randn(1,5)a = np.random.randn(1,5)
    • Check by assert(a.shape==(5,1))assert(a.shape == (5,1))
    • Fix rank 1 array by a=a.reshape((5,1))a = a.reshape((5,1))
  • Logistic Regression Cost Function

    • Lost
      • p(yx)=y^y(1y^)(1y)p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}
      • If y=1y=1: p(yx)=y^p(y|x)=\hat{y}
      • If y=0y=0: p(yx)=(1y^)p(y|x)=(1-\hat{y})
      • logp(yx)=logy^y(1y^)(1y)=ylogy^+(1y)log(1y^)=L(y^,y)\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)
    • Cost
      • logp(labels in training set)=logΠi=1mp(y(i),x(i))\log p(labels\space in\space training\space set)=\log \Pi_{i=1}^{m}p(y^{(i)},x^{(i)})
      • logp(labels in training set)=i=1mlogp(y(i),x(i))=i=1mL(y^(i),y(i))\log p(labels\space in\space training\space set)=\sum\limits_{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})
      • Use maximum likelihood estimation(MLE)
      • Cost(minmize): J(w,b)=1mi=1mL(y^(i),y(i))J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})

1.3 Shallow Neural Networks

1.3.1 Neural Network Representation

  • deep-learning-notes_1-3-1

  • Input layer, hidden layer, output layer

    • a[0]=xa^{[0]}=x -> a[1]=[[a1[1],a2[1],a3[1],a4[1]]]a^{[1]}=[[a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4]] -> a[2]a^{[2]}
    • Layers count by # of hidden layer+# of output layer.
  • x1,x2,x3x_1,x_2,x_3 -> 4 hidden nodes4\space hidden\space nodes -> Output layerOutput\space layer

    • First hidden node: z1[1]=w1[1]T+b1[1],a1[1]=σ(z1[1])z^{[1]}_1=w^{[1]T}_1+b^{[1]}_1, a^{[1]}_1=\sigma(z^{[1]}_1)
    • Seconde hidden node: z2[1]=w2[1]T+b2[1],a2[1]=σ(z2[1])z^{[1]}_2=w^{[1]T}_2+b^{[1]}_2, a^{[1]}_2=\sigma(z^{[1]}_2)
    • Third hidden node: z3[1]=w3[1]T+b3[1],a3[1]=σ(z3[1])z^{[1]}_3=w^{[1]T}_3+b^{[1]}_3, a^{[1]}_3=\sigma(z^{[1]}_3)
    • Forth hidden node: z4[1]=w4[1]T+b4[1],a4[1]=σ(z4[1])z^{[1]}_4=w^{[1]T}_4+b^{[1]}_4, a^{[1]}_4=\sigma(z^{[1]}_4)
  • Vectorization

    • w[1]=[w1[1]Tw2[1]Tw3[1]Tw4[1]T](4,3)matrixw^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \\ -w^{[1]T}_2- \\ -w^{[1]T}_3- \\ -w^{[1]T}_4- \end{bmatrix}\end{gathered} (4,3)matrix
    • z[1]=[w1[1]Tw2[1]Tw3[1]Tw4[1]T][x1x2x3]+[b1[1]b2[1]b3[1]b4[1]]=[w1[1]Tx+b1[1]w2[1]Tx+b2[1]w3[1]Tx++b3[1]w4[1]Tx+b4[1]]=[z1[1]z2[1]z3[1]z4[1]]z^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \\ -w^{[1]T}_2- \\ -w^{[1]T}_3- \\ -w^{[1]T}_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{[1]}_1 \\ b^{[1]}_2 \\b^{[1]}_3 \\ b^{[1]}_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{[1]T}_1\cdot x+b^{[1]}_1 \\ w^{[1]T}_2\cdot x+b^{[1]}_2 \\ w^{[1]T}_3\cdot x++b^{[1]}_3 \\ w^{[1]T}_4\cdot x+b^{[1]}_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{[1]}_1 \\ z^{[1]}_2 \\z^{[1]}_3 \\ z^{[1]}_4 \end{bmatrix}\end{gathered}
    • a[1]=[a1[1]a2[1]a3[1]a4[1]]=σ(z[1])a^{[1]}=\begin{gathered}\begin{bmatrix}a^{[1]}_1 \\ a^{[1]}_2 \\a^{[1]}_3 \\ a^{[1]}_4 \end{bmatrix}\end{gathered}=\sigma(z^{[1]})
    • z[2]=W[2]a[1]+b[2]z^{[2]}=W^{[2]}\cdot a^{[1]}+b^{[2]} (1,1),(1,4),(4,1),(1,1)(1, 1),(1, 4),(4, 1),(1, 1)
    • a[2]=σ(z[2])a^{[2]}=\sigma(z^{[2]}) (1,1),(1,1)(1,1),(1,1)
    • a[2](i)a^{[2](i)}: layer 22; example ii
  • for i=1 to m:

    • z[1](i)=W[1]x(i)+b[1]z^{[1](i)}=W^{[1]}\cdot x(i)+b^{[1]}
    • a[1](i)=σ(z[1](i))a^{[1](i)}=\sigma(z^{[1](i)})
    • z[2](i)=W[2]a[1](i)+b[2]z^{[2](i)}=W^{[2]}\cdot a^{[1](i)}+b^{[2]}
    • a[2](i)=σ(z[2](i))a^{[2](i)}=\sigma(z^{[2](i)})
  • Vectorizing of the above for loop

    • X=[x(1),x(2),...,x(m)](nx,m)matrixX=\begin{gathered}\begin{bmatrix}| & | & | & | \\ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \\ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrix n is different hidden units
    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=σ(Z[1])A^{[1]}=\sigma(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=σ(Z[2])A^{[2]}=\sigma(Z^{[2]})
    • hrizontally: training examples; vertically: hidden units

1.3.2 Activation Functions

  • g[i]g^{[i]}: activation function of layer ii

    • Sigmoid: a=11+e[z]a=\dfrac{1}{1+e^{[-z]}}
    • Tanh: a=eze[z]ez+e[z]a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}
    • ReLU: a=max(0,z)a=max(0,z)
    • Leaky ReLu: a=max(0.01z,z)a=max(0.01z, z)
  • Rules to choose activation function

    1. Output is between {0, 1}, choose sigmoid.
    2. Default choose ReLu.
  • Why need non-liner activation function

    • Use linear hidden layer will be useless to have multiple hidden layers. It will become a=wx+ba=w'x+b'.
    • Linear may sometime use at output layer but with non-linear at hidden layers.

1.3.3 Forward and Backward Propagation

  • Derivative of activation function

    • Sigmoid: g(z)=ddzg(z)=11+e[z](111+e[z])=g(z)(1g(z))=a(1a)g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)
    • Tanh: g(z)=ddzg(z)=1(tanh(z))2g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2
    • ReLU: g(z)={0if z<01if z0\usepackageundefined\usepackageif z=0g'(z)=\left\{\begin{array}{lr}0&if \space z<0 \\1&if \space z\geq0\\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.
    • Leaky ReLU: g(z)={0.01if z<01if z0g'(z)=\left\{\begin{array}{lr}0.01&if \space z<0 \\1&if \space z\geq0\end{array}\right.
  • Gradient descent for neural networks

    • Parameters: w[1](n[1],n[2]),b[1](n[2],1),w[2](n[2],n[1]),b[2](n[2],1)w^{[1]}(n^{[1]},n^{[2]}), b^{[1]}(n^{[2]},1),w^{[2]}(n^{[2]},n^{[1]}), b^{[2]}(n^{[2]},1)
    • nx=n[0],n[1],n[2]=1n_x=n^{[0]},n^{[1]},n^{[2]}=1
    • Cost function: J(w[1],b[1],w[2],b[2])=1mi=1nL(y^,y)J(w^{[1]}, b^{[1]},w^{[2]}, b^{[2]})=\dfrac{1}{m}\sum\limits_{i=1}^nL(\hat{y},y)
  • Forward propagation:

    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=g[2](Z[2])=σ(Z[2])A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
  • Back Propagation:

    • dZ[2]=A[2]YdZ^{[2]}=A^{[2]}-Y Y=[y(1),y(2),...,y(m)]Y=[y^{(1)},y^{(2)},...,y^{(m)}]

    • dW[2]=1mdZ[2]A[1]TdW^{[2]}=\dfrac{1}{m}dZ^{[2]}A^{[1]T}

    • db[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)db^{[2]}=\dfrac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)

    • dZ[1]=W[2]TdZ[2]g[1](Z1)dZ^{[1]}=W^{[2]T}dZ^{[2]}*g'^{[1]}(Z^{1})

      • (n[1],m)>elementwise product>(n[1],m)(n^{[1]},m)->element-wise\space product->(n^{[1]},m)
    • dW[1]=1mdZ[1]XTdW^{[1]}=\dfrac{1}{m}dZ^{[1]}X^{T}

    • db[1]=1mnp.sum(dZ[1],axis=1,keepdims=True)db^{[1]}=\dfrac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)

  • Random Initialization

    • x1,x2>a1[1],a2[1]>a1[2]>y^x_1,x_2->a_1^{[1]},a_2^{[1]}->a_1^{[2]}->\hat{y}
    • w[1]=np.random.randn((2,2))0.01w^{[1]}=np.random.randn((2,2))*0.01
    • b[1]=np.zeros((2,1))b^{[1]}=np.zeros((2,1))
    • w[2]=np.random.randn((1,2))0.01w^{[2]}=np.random.randn((1,2))*0.01
    • b[2]=0b^{[2]}=0

1.4 Deep Neural Networks

1.4.1 Deep L-Layer Neural Network

  • Deep neural network notation
    • deep-learning-notes_1-4-1
    • L=4L=4 (#layers)
    • n[l]=# units in layer ln^{[l]}= \#\space units\space in\space layer\space l
      • n[1]=5,n[2]=5,n[3]=3,n[4]=n[l]=1n^{[1]}=5,n^{[2]}=5,n^{[3]}=3,n^{[4]}=n^{[l]}=1
      • n[0]=nx=3n^{[0]}=n_x=3
    • a[l]=activations in layer la^{[l]}=activations\space in\space layer\space l
    • a[l]=g[l](z[l]), w[l]=weights for z[l], b[l]=bias for z[l]a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}
    • x=a[0], y^=alx=a^{[0]},\space \hat{y}=a^{l}

1.4.2 Forward Propagation in a Deep Network

  • General: Z[l]=w[l]A[l1]+b[l],A[l]=g[l](Z[l])Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})

    • x:z[1]=w[1]a[0]+b[1],a[1]=g[1](z[1])x: z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}, a^{[1]}=g^{[1]}(z^{[1]}) a[0]=Xa^{[0]}=X
    • z[2]=w[2]a[1]+b[2],a[1]=g[2](z[2])z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}, a^{[1]}=g^{[2]}(z^{[2]})
    • ...
    • z[4]=w[4]a[3]+b[4],a[4]=g[4](z[4])=y^z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}, a^{[4]}=g^{[4]}(z^{[4]})=\hat{y}
  • Vectorizing:

    • Z[1]=w[1]A[0]+b[1],A[1]=g[1](Z[1])Z^{[1]}=w^{[1]}A^{[0]}+b^{[1]}, A^{[1]}=g^{[1]}(Z^{[1]}) A[0]=XA^{[0]}=X
    • Z[2]=w[2]A[1]+b[2],A[2]=g[2](Z[2])Z^{[2]}=w^{[2]}A^{[1]}+b^{[2]}, A^{[2]}=g^{[2]}(Z^{[2]})
    • Y^=g(Z[4])=A[4]\hat{Y}=g(Z^{[4]})=A^{[4]}
  • Matrix dimensions

    • deep-learning-notes_1-4-2
    • z[1]=w[1]x+b[1]z^{[1]}=w^{[1]}\cdot x+b^{[1]}
    • z[1]=(3,1),w[1]=(3,2),x=(2,1),b[1]=(3,1)z^{[1]}=(3,1),w^{[1]}=(3,2),x=(2,1),b^{[1]}=(3,1)
    • z[1]=(n[1],1),w[1]=(n[1],n[0]),x=(n[0],1),b[1]=(n[1],1)z^{[1]}=(n^{[1]},1),w^{[1]}=(n^{[1]},n^{[0]}),x=(n^{[0]},1),b^{[1]}=(n^{[1]},1)
    • w[l]/dw[l]=(n[l],n[l1]),b[l]/db[l]=(n[l],1)w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)
    • z[l],a[l]=(n[l],1),Z[l]/dZ[l],A[l]/dA[l]=(n[l],1)z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1) l=0,A[0]=X=(n[0],m)l=0, A^{[0]}=X=(n^{[0]},m)
  • Why deep representation?

    • Earier layers learn simple features; later deeper layers put together to detect more complex things.
    • Circuit theory and deep learning: Informally: There are functions you can compute with a "small" L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

1.4.3 Building Blocks of Deep Neural Networks

  • Forward and backward functions

    • deep-learning-notes_1-4-3
    • Layer l:w[l],b[l]l:w^{[l]},b^{[l]}
    • Forward: Input a[l1]a^{[l-1]}, output a[l]a^{[l]}
      • z[l]:w[l]a[l1]+b[l]z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]} cache z[l]cache\space z^{[l]}
      • a[l]:g[l](z[l])a^{[l]}:g^{[l]}(z^{[l]})
    • Backward: Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}
  • One iteration of gradient descent of neural network

    • deep-learning-notes_1-4-3-2
  • How to implement?

    • Forward propagation for layer ll

      • Input a[l1]a^{[l-1]}, output a[l],cache (z[l])a^{[l]},cache\space (z^{[l]})

        • z[l]=w[l]a[l1]+b[l]z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}
        • a[l]=g[l](z[l])a^{[l]}=g^{[l]}(z^{[l]})
      • Vectoried

        • Z[l]=W[l]A[l1]+b[l]Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}
        • A[l]=g[l](Z[l])A^{[l]}=g^{[l]}(Z^{[l]})
    • Backward propagation for layer ll

      • Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}

        • dz[l]=da[l]g[l](z[l])dz^{[l]}=da^{[l]}*g'^{[l]}(z^{[l]})
        • dw[l]=dz[l]a[l1]dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}
        • db[l]=dz[l]db^{[l]}=dz^{[l]}
        • da[l1]=w[l]Tdz[l]da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}
      • Vectorized:

        • dZ[l]=dA[l]g[l](Z[l])dZ^{[l]}=dA^{[l]}*g'^{[l]}(Z^{[l]})
        • dW[l]=1mdZ[l]A[l1]TdW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}
        • db[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)
        • dA[l1]=W[l]TdZ[l]dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}

1.4.4 Parameters vs. Hyperparameters

  • Parameters: W[1],b[1],W[2],b[2],...W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]},...

  • Hyperparameters (will affect/control/determine parameters):

    • learning rate α\alpha
    • # iterations
    • # of hidden units n[1],n[2],...n^{[1]},n^{[2]},...
    • # of hidden layers
    • Choice of activation function
  • Later: momentum, minibatch size, regularization parameters,...

II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

2.1 Practical Aspects of Deep Learning

2.1.1 Train / Dev / Test sets

  • Big data may need only 1% or even less dev/test sets.
  • Mismatched: Make sure dev/test come from same distribution
  • Not having a test set might be okay. (Only dev set.)

2.1.2 Bias / Variance

deep-learning-notes_2-1-2

deep-learning-notes_2-1-2-2

  • Assume optimal (Bayes) error: 0%\approx0\%
  • High bias (underfitting): The prediction cannot classify different elemets as we want.
    • Training set error 15%15\%, Dev set error 16%16\%.
    • Training set error 15%15\%, Dev set error 30%30\%.
  • "just right": The prediction perfectly classify different elemets as we want.
    • Training set error 0.5%0.5\%, Dev set error 1%1\%.
  • High variance (overfitting): The prediction 100% classify different elemets.
    • Training set error 1%1\%, Dev set error 11%11\%.
    • Training set error 15%15\%, Dev set error 30%30\%.

2.1.3 Basic Recipe for Machine Learning

2.1.3.1 Basic Recipe
  • High bias(training data performance)
    • Bigger network
    • Train longer
    • (NN architecture search)
  • High variance (dev set performance)
    • More data
    • Regularization
    • (NN architecture search)
2.1.3.2 Regularization
  • Logistic regression. minw,bJ(w,b)\min\limits_{w,b}J(w,b)
    • wRnx,bRw\in\mathbb{R}^{n_x}, b\in\mathbb{R}
    • λ=regularization parameter\lambda=regularization\space parameter
    • J(w,b)=1mi=1mL(y^(i),y(i))+λ2mw22J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||_2
    • L2 regularization w22=j=1nxwj2=wTw||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw
    • L1 regularization λ2mj=1nxwj=λ2mw1\dfrac{\lambda}{2m}\sum\limits_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||_1
      • ww will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
  • Neural network
    • J(w[1],b[1],...,w[l],b[l])=1mi=1mL(y^(i),y(i))+λ2ml=1lw2FJ(w^{[1]},b^{[1]},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits_{l=1}^{l}||w^2||_F
    • w[l]F2=i=1n[l1]j=1n[l](wij[l])2||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2 w:(w[l],w[l1])w: (w^{[l]},w^{[l-1]})
      • Frobenius norm: Square root of square sum of all elements in a matrix.
    • dw[l]=(from backprop)+λmw[l]dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}
      • w[l]:=w[l]αdw[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]} (keep the same)
      • Weight decay
        • w[l]:=w[l]α[(from backprop)+λmw[l]]w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]
        • =w[l]αλmw[l]α(from backprop)=w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)
        • =(1αλm)w[l]α(from backprop)=(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)
  • How does regularization prevent overfitting: λ\lambda bigger w[l]w^{[l]} smaller z[l]z^{[l]} smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
  • Dropout regularization
    • deep-learning-notes_2-1-3-2
    • Implementing dropout("Inverted dropout")
      • Illustrate with layer l=3l=3 keepprob=0.8keep-prob=0.8 (means 0.2 chance get dropout/be 0 out)
      • d3=np.random.rand(a3.shape[0],a3.shape[1])<keepprobd3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep-prob #This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
      • a3=np.multiply(a3,d3)a3 = np.multiply(a3, d3) #a3*=d3; This will let some neruons been dropout
      • a3/=keepproba3/=keep-prob #inverted dropout, keep the total avtivation the same before and after dropout.
    • Why work: Can't rely on any one feature, so have to spread out weights.(shrink weights)
    • First make sure the J is decreasing during iteration, then turn on dropout.
  • Data augmentation
    • Image: crop, flop, twist...
  • Early stopping
    • Mid-size wF2||w||_F^2
    • May caused optimize cost function and not overfir at the same time.
  • Orthogonalization
    • Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
  • Normalizing training sets
    • deep-learning-notes_2-1-3-3
    • x=[x1x2]x=\begin{gathered}\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\end{gathered}
    • Subtract mean:
      • μ=1mi=1mx(i)\mu=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}
      • x:=xμx:=x-\mu
    • Normalize variance:
      • σ2=1mi=1mx(i)2\sigma^2=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}**2 "**" element-wise
      • x/=σ2x/=\sigma^2
    • Use same μ,σ2\mu,\sigma^2 to normalize test set.
    • Why normalize inputs?
      • When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
      • deep-learning-notes_2-1-3-3-2
  • Vanishing/exploding gradients
    • w[l]>Iw^{[l]}>I Just slightly, will make the gradient increase really fast (exploding).
    • w[l]<Iw^{[l]}<I Just slightly, will make the gradient decrease really slow (varnishing).
  • Weight initalization (Single neuron)
    • large nn (number of input features) --> smaller wiw_i
    • Variance(w:)=1nVariance(w:)=\dfrac{1}{n} (sigmoid/tanh) ReLU: 2n\dfrac{2}{n} (variance can be a hyperparameter, DO NOT DO THAT)
    • w[l]=np.random.randn(shapeOfMatrix)np.sqrt(1n[l1])w^{[l]}=np.random.randn(shapeOfMatrix)*np.sqrt(\dfrac{1}{n^{[l-1]}}) ReLU: 2n[l1]\dfrac{2}{n^{[l-1]}}
    • Xavier initialization: 1n[l1])\sqrt{\dfrac{1}{n^{[l-1]}})} Sometime 2n[l1]+n[l])\sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}
  • Numerical approximation of gradients
    • f(θ+ϵ)f(θϵ)2ϵ\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}
  • Gradient checking (Grad check)
    • Take W[1],b[1],...,W[L],b[L]W^{[1]},b^{[1]},...,W^{[L]},b^{[L]} and reshape into a big vector θ\theta.
    • Take dW[1],db[1],...,dW[L],db[L]dW^{[1]},db^{[1]},...,dW^{[L]},db^{[L]} and reshape into a big vector dθd\theta.
    • for each i:
      • dθapprox[i]=J(θ1,θ2,...,θi+ϵ,...)J(θ1,θ2,...,θiϵ,...)2ϵdθ[i]=Jθid\theta_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}
      • Check Euclidean distance dθapproxdθ2dθapprox2+dθ2\dfrac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2} (.2||.||_2 is Euclidean norm, sqare root of the sum of all elements' power of 2)
      • take ϵ=107\epsilon=10^{-7}, if above Euclidean distance is 107\approx10^{-7} or smaller, is great.
      • If is 10510^{-5} or bigger may need to check.
      • If is 10310^{-3} or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
    • notes:
      • Don't use in training - only to debug.
      • If algorithm fails grad check, look at components to try to identify bug.
      • Remember regularization. (include the λ2m\dfrac{\lambda}{2m})
      • Doesn't work with dropout. (since is random, implement without dropout)
      • Run at random initialization; perhaps again after some training. (not work when w,b0w,b\approx0)

2.2 Optimization Algorithms

2.2.1 Mini-batch gradient descent

  • Batch vs. mini-batch gradient descent
    • Normal batch may have large amount of data like millions of elements.
      • set m=5,000,000m=5,000,000
      • X=[x(1),x(2),x(3),...,x(1000),x(1001),...,x(2000),...,x(m)](nx,m)X=[x^{(1)},x^{(2)},x^{(3)},...,x^{(1000)},x^{(1001)},...,x^{(2000)},...,x^{(m)}] (n_x,m)
      • Y=[y(1),y(2),y(3),...,y(m)](1,m)Y=[y^{(1)},y^{(2)},y^{(3)},...,y^{(m)}] (1,m)
    • Mini-batches make 1,000 xx each.
      • Mini-batch number t:X{t},Y{t}t:X^{\{t\}},Y^{\{t\}}
        • x(i)x^{(i)} ith in trainning set, z[l]z^{[l]} layer in network X{t}X^{\{t\}} batch in mini-batch
      • X=[X{1},X{2},...,X{5000}]X = [X^{\{1\}},X^{\{2\}},...,X^{\{5000\}}]
      • Y=[Y{1},Y{2},Y{3},...,Y(5,000)]Y=[Y^{\{1\}},Y^{\{2\}},Y^{\{3\}},...,Y^{(5,000)}]
  • Mini-batch gradient descent
    • 1 step of gradient descent using X{t},Y{t}X^{\{t\}},Y^{\{t\}} (1000)
      • 1 epoch: single pass through training set.
    • for t=1,...,5000for\space t=1,...,5000
      • Forward prop on X{t}X^{\{t\}}
      • Z[1]=W[1]X{t}+b[1]Z^{[1]}=W^{[1]}X^{\{t\}}+b^{[1]}
      • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})
      • ...
      • A[l]=g[l](Z[l])A^{[l]}=g^{[l]}(Z^{[l]})
    • Compute cost J{t}=11000i=1lL(y^(i),y(i))+λ21000l=1lw[l]F2J^{\{t\}}=\dfrac{1}{1000}\sum\limits_{i=1}^{l}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2\cdot1000}\sum\limits_{l=1}^{l}||w^{[l]}||_F^2
      • y^(i),y(i)\hat{y}^{(i)},y^{(i)} --> from X{t},Y{t}X^{\{t\}},Y^{\{t\}}
    • Backprop to compute gradient cost J{t} (using (X{t},Y{t}))J^{\{t\}}\space (using\space (X^{\{t\}},Y^{\{t\}}))
    • w[l]:=w[l]αdw[l],b[l]:=b[l]αdb[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]}, b^{[l]}:=b^{[l]}-\alpha db^{[l]}
  • Understanding mini-batch gradient descent
    • deep-learning-notes_2-2-1
    • If mini-batch size=m: batch gradient descent (Too long per iteration).--(X{t},Y{t})=(X,Y)(X^{\{t\}},Y^{\{t\}})=(X,Y)
    • If mini-batch size=1: Stochastic gradient descent (noisy, not converge, loos speedup from vectorization).-- Every example is it own mini-batch.
    • In practice: select in-between 1 and m.
      • Get lots of vectorization
      • Make progress without needing to wait entire training set.
  • Choosing mini-batch size
    • No need for small training set (m<2000m<2000)
    • Typical mini-batch size: 64, 128, 256, 512. (Use power of 2)
    • Make sure minibatch fir in CPU/GPU memory.

2.2.2 Exponentially weighted averages

  • Vt=βVt1+(1β)θtV_t = \beta V_{t-1} + (1 - \beta) \theta_t

    • VtV_t is the weighted average at time tt.
    • θt\theta_t is the actual observed value at time tt.
    • β\beta is the decay rate (usually between 0 and 1).
    • Vt1V_{t-1} is the weighted average at the previous time step.
  • Impact of Decay Rate β\beta: The value of β\beta significantly affects the smoothness of the weighted average curve:

    • A larger β\beta makes the curve smoother, as it gives more weight to past observations, thereby reducing the impact of recent changes on the weighted average.
    • A smaller β\beta makes the curve more responsive to recent changes, as it gives more weight to recent observations.
  • Interpretation of (1ϵ)1ϵsome constant=1e\dfrac{(1-\epsilon)^{\frac{1}{\epsilon}}}{\text{some constant}} = \dfrac{1}{e}

    • Defining ϵ\epsilon as 1β1 - \beta provides insight into how the influence of past data gradually diminishes as β\beta approaches 1 (i.e., ϵ\epsilon approaches 0).
    • As ϵ\epsilon approaches 0, (1ϵ)1ϵ(1-\epsilon)^{\frac{1}{\epsilon}} approaches 1e \dfrac{1}{e}, indicating that even though past data is given more weight (high β \beta), its actual impact on the current value is decreasing.
  • Implementation

    • vθ:=0v_{\theta}:=0
    • Repear for each day:
      • Get the next θt\theta_t
      • vθ:=βvθ+(1β)θtv_\theta:=\beta v_\theta+(1-\beta)\theta_t
  • Bias correction in exponentially weighted averages

    • Bias correction is applied to counteract the initial bias in exponentially weighted averages, especially when the number of data points is small or at the start of the calculation.
    • vt1βt\dfrac{v_t}{1-\beta^t} Here, vtv_t is the uncorrected exponentially weighted average at time tt, and β\beta is the decay rate.
    • It ensures that the moving averages are not underestimated, particularly when β\beta is high and in the early stages of the iteration. With iteration goes on, the affect of this correction will become smaller since βt\beta^t is closer to 1.
  • Gradient descent with momentum

    • On iteration t:
      • Compute dw,dbdw, db on current mini-batch (whole batch if not using mini-batch)
      • vdw=βvdw+(1β)dwv_{dw}=\beta v_{dw}+(1-\beta)dw
      • vdb=βvdb+(1β)dbv_{db}=\beta v_{db}+(1-\beta)db
      • w:=wαvdw,b:=bαvdbw:=w-\alpha v_{dw}, b:=b-\alpha v_{db}
    • initiate vdw and vdb=0v_{dw}\space and\space v_{db} = 0
    • Smooth out gradient descent
      • The momentum term vv effectively provides a smoothing effect since it is an average of past gradients. This means that extreme gradient changes in a single iteration are averaged out, reducing the volatility of the update steps.
      • This smoothing effect is particularly useful on loss function surfaces that are not flat or have many local minima.
    • Consider set β\beta as 0.90.9 (common, about the average last 10 gradients), it gives more weight to vdwv_{dw}, consider dwdw as the acceleration. With betabeta decreasing, velocity increasing slower and acceleration increasing faster.

2.2.3 RMSprop and Adam optimization

  • RMSprop (Root Mean Square Propagation)
    • On iteration t:
      • Compute dw,dbdw, db on current mini-batch
      • sdw=β2sdw+(1β2)dw2s_{dw}=\beta_2 s_{dw}+(1-\beta_2)dw^2 Hope to be relative small.
      • sdb=β2sdb+(1β2)db2s_{db}=\beta_2 s_{db}+(1-\beta_2)db^2 Hope to be relative large.
      • w:=wαdwsdw+ϵ,b:=bαdbsdb+ϵw:=w-\alpha\dfrac{dw}{\sqrt{s_{dw}}+\epsilon}, b:=b-\alpha\dfrac{db}{\sqrt{s_{db}}+\epsilon} ϵ\epsilon is a relative small number(10810^{-8}) ot prevent nominator being 0.
    • Slow down in vertical direction, fast in horizontal direction.
  • Adam (Adaptive moment estimation) optimization algorithm
    • vdw=0,sdw=0.vdb=0,sdw=0v_{dw}=0, s_{dw}=0. v_{db}=0, s_{dw}=0
    • On iteration t:
      • Compute dw,bddw, bd using current mini-batch
      • vdw=β1vdw+(1β1)dw,vdb=β1vdb+(1β1)dbv_{dw}=\beta_1v_{dw}+(1-\beta_1)dw,v_{db}=\beta_1v_{db}+(1-\beta_1)db
      • sdw=β2sdw+(1β2)dw2,sdb=β2sdb+(1β2)dbs_{dw}=\beta_2s_{dw}+(1-\beta_2)dw^2,s_{db}=\beta_2s_{db}+(1-\beta_2)db
      • vdwcorrected=vdw1β1t,vdbcorrected=vdb1β1tv_{dw}^{corrected}=\dfrac{v_{dw}}{1-\beta_1^t}, v_{db}^{corrected}=\dfrac{v_{db}}{1-\beta_1^t}
      • sdwcorrected=sdw1β2t,sdbcorrected=sdb1βsts_{dw}^{corrected}=\dfrac{s_{dw}}{1-\beta_2^t}, s_{db}^{corrected}=\dfrac{s_{db}}{1-\beta_s^t}
      • w:=wαvdwcorrectedsdwcorrected+ϵ,b:=bαvdbcorrectedsdbcorrected+ϵw:=w-\alpha\dfrac{v_{dw}^{corrected}}{\sqrt{s_{dw}^{corrected}}+\epsilon}, b:=b-\alpha\dfrac{v_{db}^{corrected}}{\sqrt{s_{db}^{corrected}}+\epsilon}
    • Hyperparameters choice:
      • α\alpha: needs to be tune
      • β1\beta_1: 0.9 (dwdw) First moment
      • β2\beta_2: 0.999 (dw2dw^2) Second moment
      • ϵ:108\epsilon: 10^{-8} Not affect performance
  • Learning rate decay
    • 1 epoch = 1 pass through the data
    • α=11+decayrateepochnumα0\alpha=\dfrac{1}{1+decay-rate*epoch-num}\alpha_0
    • Other methods
      • α=0.95epochnumα0\alpha=0.95^{epoch-num}\cdot \alpha_0 ---- exponentially decay
      • α=kepochnumα0\alpha=\dfrac{k}{\sqrt{epoch-num}}\cdot\alpha_0 or ktα0\dfrac{k}{\sqrt{t}}\cdot\alpha_0 ---- discrete staircase
      • Manual decay (small number of model)
  • The problem of local optima
    • Unlikely to stuck in a bad local optima, since there are too many dimensions and all algorithms in deep learning.
    • saddle point ---- gradient = 0
    • Problem of plateaus: Make learning slow

2.3 Hyperparameter Tuning, Batch Normalization, and Programming Frameworks

2.3.1 Tuning process

  • Hyperparameters
    • α\alpha: learning rate (1st)
    • β\beta: momentum (2nd)
    • β1,β2,ϵ\beta_1, \beta_2, \epsilon
    • # of layers (3rd)
    • # of hidden units (2nd)
    • learning rate decay (3rd)
    • mini-batch size (2nd)
  • Try random values: Don't use a grid
  • Coarse to fine: Trying coarse random first, then fine in working well range.

2.3.2 Using an appropriate scale to pick hyperparameters

  • Learning rate: α=0.0001,...,1\alpha = 0.0001,...,1

    • r=4np.random.rand()r=-4*np.random.rand() ---- r[4,0]r\in[-4,0]
      • r[a,b]r\in[a,b]
      • a=log100.0001=4,b=log101=0a=log_{10}0.0001 = -4, b=log_{10}1 = 0
    • α=10r\alpha=10^r ----- α[104...100]\alpha\in[10^{-4}...10^0]
  • Exponentially Weighted Averages Decay Rate: β=0.9(last 10),...,0.999(last 1000)\beta=0.9(last\space 10),...,0.999(last\space1000)

    • 1β=0.1,...,0.001 r[3,1]1-\beta=0.1,...,0.001\space r\in[-3,-1]
      • Reason for focusing on this instead of single β\beta: β\beta is too close to 1, small changes may have big affects.
    • 1β=10r1-\beta=10^r
    • β=110r\beta=1-10^r
  • In practice:

    • Re-test/Re-evaluate occasionally.
    • Babysitting one model (don't have enough training capacity) (Panda): One model at one time.
    • Training many models in parallel (Caviar): Can try many at same time.

2.3.3 Batch Normalization

  • Implementing Batch Norm
    • Batch Norm: make sure hidden units have standardized mean and variance.
    • Given some intermediate value in NN z(1),...,z(m)z[l](i)z^{(1)},...,z^{(m)}-z^{[l](i)} (ll for some hidden layers, ii for 1 through mm)
      • μ=1miz(i)\mu=\dfrac{1}{m}\sum\limits_{i}z^{(i)} (Mean)
      • σ2=1mi(ziμ)2\sigma^2=\dfrac{1}{m}\sum\limits_i(z_i-\mu)^2 (Variance)
      • znorm(i)=z(i)μσ2+ϵz_{norm}^{(i)}=\dfrac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}} (Make sure mean=0, variance=1. ϵ\epsilon prevent denominator=0)
      • z~(i)=γznorm(i)+β\widetilde{z}^{(i)}=\gamma z_{norm}^{(i)}+\beta (γ,β\gamma, \beta are learnable parameters of model)
        • If
          • γ=σ2+ϵ\gamma=\sqrt{\sigma^2+\epsilon}
          • β=μ\beta=\mu
        • Then z~(i)=znorm(i)\widetilde{z}^{(i)}=z_{norm}^{(i)}
    • Use z~[l](i)\widetilde{z}^{[l](i)} instead of z[l](i)z^{[l](i)}
  • Adding Batch Norm to a network
    • deep-learning-notes_2-3-3
    • Parameters: w[1],b[1],β[1],γ[1],...,w[l],b[l],β[l],γ[l]w^{[1]},b^{[1]},\beta^{[1]},\gamma^{[1]},...,w^{[l]},b^{[l]},\beta^{[l]},\gamma^{[l]}
      • May use gradient/Adam/momentum to tune dβ[l]d\beta^{[l]} β[l]=β[l]αdβ[l]\beta^{[l]}=\beta^{[l]}-\alpha d\beta^{[l]}
    • Working with mini-batches: Work the same but on single batches. No need for b[l]b^{[l]}, since variance are all 1. β  γ\beta\space\space\gamma have same dimension with bb.
  • Implementing gradient descent (works with momentum, RMSprop, Adam)
    • for t=1t=1...numMiniBatches
      • Compute forwardProp on XtX^{{t}}.
        • In each hidden layer use BN to replace z[l]z^{[l]} with z~[l]\widetilde{z}^{[l]}
      • Use backprop to compute dw[l],db[l],dβ[l],dγ[l]dw^{[l]},db^{[l]},d\beta^{[l]},d\gamma^{[l]} (no dbdb)
      • Update parameters
        • w[l]:=w[l]αdw[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]}
        • β[l]:=β[l]αdβ[l]\beta^{[l]}:=\beta^{[l]}-\alpha d\beta^{[l]}
        • γ[l]:=γ[l]αdγ[l]\gamma^{[l]}:=\gamma^{[l]}-\alpha d\gamma^{[l]}
  • Why does Batch Norm work
    • Covariate Shift: Different test and training data (training on black cats but try to test on other color of cats).
      • Internal Covariate Shift: Between different layers of the network, the distribution of inputs to each layer changes. Recursively it changes the input of the latter layer. May lead to instability and reduced efficiency.
      • Batch norm reduces the problem of input values changes. Make input stable. Let the network learn more independent.
    • Batch norm as regularization
      • In mini-batch, each batch is scaled by the mean/variance computed on just that mini-batch. May adds some noise to each hidden layer's (since is not consider the whole training set) (similar to dropout).
      • This has a slight regularization effect. (Use larger mini-batch size could reduce regularization)
  • Batch Norm at test time
    • μ,σ2\mu, \sigma^2: estimate using exponentially weighted average (across mini-batch).
      • μglobal=βμglobal+(1β)μ\mu*{\text{global}} = \beta \mu*{\text{global}} + (1 - \beta) \mu
      • σ2global=βσ2global+(1β)σ2\sigma^2*{\text{global}} = \beta \sigma^2*{\text{global}} + (1 - \beta) \sigma^2
    • During testing, use the global mean and variance estimates for normalization, instead of the statistics from the current test sample or mini-batch.

2.3.4 Multi-class classification

  • Softmax regression
    • CC = # classes = 4 (0,...,3)
    • Output layer: 4 nodes for each class. y^\hat{y} is (4,1) matrix, sum should be 1.
    • z[L]=w[L]a[L1]+bLz^{[L]}=w^{[L]}a^{[L-1]}+b^{L} (4,1) vector (L represents the output layer)
    • Activation function:
      • t=e(z[L])t=e^{(z^{[L]})} (4,1) vector
    • a[L]=ez[L]j=14ti  (4,1),ai[L]=tij=14tia^{[L]}=\dfrac{e^{z^{[L]}}}{\sum\limits_{j=1}^4t_i}\space\space(4,1), a^{[L]}_i=\dfrac{t_i}{\sum\limits_{j=1}^4t_i}
  • Hardmax: Change beigest to 1, rest all set to 0.
  • Training a softmax classifier
    • If C=2C=2, softmax reduces to logistic regression.
    • Loss function:
      • z[L]z^{[L]}->a[L]=y^a^{[L]}=\hat{y}->L(y^,y)L(\hat{y},y) (4,1)
      • Backprop: dz[L]=y^ydz^{[L]}=\hat{y}-y (4,1) dz[L]=JZ[L]dz^{[L]}=\dfrac{\partial{J}}{\partial{Z^{[L]}}}
  • Deep Learning frameworks
    • TensorFlowdeep-learning-notes_2-3-4

III. Structuring Machine Learning Projects

3.1 ML Strategy (1)

3.1.1 Setting up your goal

  • Orthogonalization

    • Chain of assumptions in ML
      • Fir training set well on cost function: bigger network; Adam
      • Fit dev set well on cost function: Regularization; Bigger training set
      • Fit test set well on cost function: Bigger dev set
      • Perorms well in real world: Change dev set or cost function
  • Single number evaluation metric

    • Precision: In examples recognized, what percentage are actually true.
    • Recall: What percentage of target are correctly recognized in whole test set.
    • F1 Score: Average of precision and recall. 11P+1R\dfrac{1}{\dfrac{1}{P}+\dfrac{1}{R}} (harmonic mean)
    • Dev set + Single number evaluation matrix: Speed up iteration
    • Use average error rate instead of single error rate for each classes in estimate many classes at same time.
  • Satisfying and optimizing matrics

    • Consider classifiers with accuracy and running time.

      • maximize accuracy and subject to running time <= 100ms
      • Accuracy: optimizing
      • Running time: satisfying
    • N metic: 1 optimizing, n-1 satisfying

  • Train/dev/test distributions

    • Come from same distribution. (Use randomly shuffle)
    • Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
  • Size of dev/test set

    • For large data set, use 98% training, 1% dev, 1% test
    • Size of test set: Set your test set to be big enough to give high confidence in the overall performance of your system.
    • Sometime use only train+dev, without test set.
  • When to change dev/test sets and metrics

    • Filter pornographic images out of error rate:

      • deep-learning-notes_3-1-1
    • Two Steps

      1. How to define a metric to evaluate classifiers.
      2. How to do well on this metric.
    • If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

3.1.2 Comparing to human-level performance

  • Why human-level performance:
    • Bayes (optimal) error: best possible error. Can never surpass.
    • Humans are quite good at a lot of tasks. So long as ML is worse than humans, you can:
      • Get labeled data from humans.
      • Gain insight from manual error analysis: Why did a person get this right?
      • Better analysis of bias/variance.

3.1.3 Analyzing bias and variance

  • Avoidable bias
    • If training error is far from human error (bayes error), focus on bias (avoidable bias). If training error is close to human error but far from dev error, focus on variance.
    • Consider human-level error as a proxy for Bayes error (since is not too far from human-level error to Bayes error).
  • Understanding human-level performance:
    • Based on purpose defined which is the human-level error want to use.
    • If human can perform really well, we can use human-level error as proxy for Bayes error.
  • Surpassing human-level performance
    • Not natural perception
    • Lots of data
  • Improving your model performance
    • The two fundamental assumptions of supervised learning
      • You can fit the training set pretty well. (Avoidable bias)
      • The training set performance generalizes pretty well to the dev/test set. (Variance)
    • Reducing (avoidable) bias and variance
      • Avoidable bias:
        • Train bigger model.
        • Train longer/better optimization, algorithms (momentum, RMSprop, Adam).
        • NN architecture/hyperparameters search (RNN, CNN).
      • Variance:
        • More data.
        • Regularization (L2, dropout, data augmentation).
        • NN architecture/hyperparameters search.

3.2 ML Strategy (2)

3.2.1 Error analysis

  • Carrying out error analysis
    • Error analysis (count mislabel, minus from the error rate get the ceiling of error rate)
      • Get ~100 mislabeled dev set examples.
      • Count up how many are dogs.
    • Evaluate multiple ideas in parallel (ideas for cat detection)
      • Fix pictures of dogs being recognized as cats
      • Fix great cats (lions, panthers, etc.) being mis-recognized
      • Improve performance on blurry images
      • Check the details of mislabeled images (only few minutes/hours)
      • deep-learning-notes_3-2-1
  • Cleaning up incorrectly labeled data
    • DL algorithms are quite robust to random errors in the training set. (random error will not affect the algorithm too much)
    • DL algorithms are less robust to systematic errors.
    • When a high fraction of mistake is due to incorrectly label, should spend time to fix it.
    • Correcting incorrect dev/test set examples
      • Apply same process to your dev and test sets to make sure they continue to come from the same distribution.
      • Consider examining examples your algorithm got right as well as ones it got wrong.
      • Train and dev/test data may now come from slightly different distributions.
  • Build your first system quickly, then iterate
    • Set up dev/test set and metric
    • Build initial system quickly
    • Use Bias/Variance analysis & Error analysis to prioritize next steps.
  • Training and testing on different distributions
    • 200,000 from high quality webpages, 10,000 from low quality mobile app (but we care about this).
      • Shuffle before use those data. (not a good option, will cause the influence of what we care small.)
      • Use mobile app as dev/test set, and just really small part of training set from app. (This we will make our target to what we want.) Maybe 50% in training, 25% in dev, and 25% test.

3.2.2 Mismatched training and dev/test set

  • Training-dev set: Same distribution as training set, but not used for training.
  • Training error - Training-dev error - Dev error
    • Human level - traning set error: avoidable bias
    • Traning error - Training-dev error: Variance
    • Training-dev error - Dev error: Data mismatch
    • Dev error - Test error: degree of overfitting to dev set.
  • Addressing data mismatch
    • Carry out manual error analysis to try to understand difference between training and dev/test sets.
    • Make training data more similar; or collect more data similar to dev/test sets.
    • Artificial data synthesis:
      • Possible issue (overfitting): Original data is 10000, only have the noise of 1, maybe overfit to this 1.
  • Transfer learning
    • Pre-training/Fine-tune
    • From relatively large data to relatively small data.
    • But if the target data is too small may not be suitable for transfer learning. (Depend on the outcome we want, it would be valuable to have more data)
    • When makes sense (transfer from A-> B):
      • Task A and B have the same input x.
      • You have a lot more data for Task A than Task B (want this one).
      • Low level features from A could be helpful for learning B.

3.2.3 Learning from multiple tasks

  • Loss function for multiple tasks
    • Loss: y^(4,1)(i)=1mi=1mj=14L(y^j(i),yj(i))\hat{y}_{(4,1)}^{(i)}=\dfrac{1}{m}\sum\limits_{i=1}^m\sum\limits_{j=1}^4L(\hat{y}_j^{(i)},y_j^{(i)})
    • Sum only over valid of j with 0/1 label. (some of them may only labeled some feature)
    • Unlike softmax regression: One image can have multiple labels
  • When multi-task learning makes sense
    • Training on a set of tasks that could benefit from having shared lower-level features.
    • Usually: Amount of data you have for each task is quite similar.
    • Can train a big enough neural network to do well on all the tasks.

3.2.4 End-to-end deep learning

  • End-to-end needs lots of data to work well.
  • Breaking small data scenario into different deep learning will be better results.
  • Wether to use end-to-end learning
    • Pros:
      • Let the data speak.
      • Less hand-designing of components needed.
    • Cons:
      • May need large amount of data
      • Excludes potentially useful hand-designed components.
    • Key question: Do you have sufficient data to learn a function of the complexity needed to map x to y?
      • Use DL to learn individual components.
      • Carefully choose X->Y mapping depending on what tasks you can got data for.

IV. Convolutional Neural Networks

4.1 Foundations of Convolutional Neural Networks

4.1.1 Convolutional operatin

  • Vertical Edge Detection

    • Used to identify vertical edges in images, which is a crucial step in image analysis and understanding.
    • A small matrix, typically 3x3 or 5x5, is used as a convolution kernel to detect vertical edges.
    • The kernel slides over the image, moving one pixel at a time.
    • At each position, element-wise multiplication is performed between the kernel and the overlapping image area, followed by a sum to produce an output feature map.
    • High values in the output feature map indicate the presence of a vertical edge at that location.
    • [101101101]\begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix}
    • Based on this matrix example below, it will detect lighter on the left and darker on the right.
  • Horizontal Edge Detection

    • Brighter on the top and darker on the bottom
    • [111000111]\begin{bmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & -1 \end{bmatrix}
    • TBC
  • Other Common Filters

    • Sobel filter

      • [101202101]\begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{bmatrix}
    • Scharr filter

      • [30310010303]\begin{bmatrix} 3 & 0 & -3 \\ 10 & 0 & -10 \\ 3 & 0 & -3 \end{bmatrix}
  • Padding

    • nxn * fxf = n-f+1 x n-f+1

    • Problems of convolution:

      • Shrinking output
      • Through away information from edge.
    • Add a padding(p) of 0

      • n+2pxn+2p * fxf = n+2p-f+1 x n+2p-f+1
    • Valid convolutions: No padding

    • Same convolutions: Pad so that output size is the same as the input size. (padding is f12\dfrac{f-1}{2})

    • f is usually odd.

  • Strided convolution

    • Stepping s steps instead of 1.
    • n+2pfs+1\dfrac{n+2p-f}{s}+1 x n+2pfs+1\dfrac{n+2p-f}{s}+1 (If not integer, bound down to the nearest integer.)
  • cross-correlation is the real name of convolution in DL.

  • Convolution over volume

    • Set the filter into the same volume as the input matrix. (e.g. RGB image with 3x3x3 filter)
    • If only look at an individual channel, just make other channel with all 0.
    • If consider vertical and horizontal seperately, each output 4x4, the final could stack together get a 4x4x2 volume.
    • n×n×ncf×f×ncn\times n\times n_c * f\times f \times n_c -> nf+1×nf+1×ncn-f+1 \times n-f+1 \times n_c^{'} (# of filters)
  • One layer of a CNN

    • Each output add a bias and apply non-learner to it. ReLU(Output+b) --> Consider stack all outputs after this as volume as the a in a=g(z)
    • Consider output as the same as the w in z=wa+b.
    • Number of parameters in one layer: If you have 10 filters that are 3x3x 3 in one layer of a neural network, how many parameters does that layer have?(Consider 3x3x3 + bias, it will be 280 parameters)
  • Summary of notation (If layer 1 is a convolution layer)

    • f[l]=f^{[l]}= filter size (3x3 filter will be f=3)
    • p[l]=p^{[l]}= padding
    • s[l]=s^{[l]}= stride
    • Input: nH[l1]×nW[l1]×nC[l1]n_H^{[l-1]}\times n_W^{[l-1]}\times n_C^{[l-1]}
    • Output: nH[l]×nW[l]×nC[l]n_H^{[l]}\times n_W^{[l]}\times n_C^{[l]}
    • nH/W[l]=nH/W[l1]+2p[l]f[l]x[l]+1n_{H/W}^{[l]}=\dfrac{n_{H/W}^{[l-1]}+2p^{[l]}-f^{[l]}}{x^{[l]}}+1 Round down to nearest integer
    • Each filter is: f[l]×f[l]×nC[l1]f^{[l]}\times f^{[l]}\times n_C^{[l-1]}
    • Activations: a[l]a^{[l]} -> nH[l]×nW[l]×nC[l]n_H^{[l]}\times n_W^{[l]}\times n_C^{[l]} Batch gradient descent A[l]A^{[l]} -> m×nH[l]×nW[l]×nC[l]m\times n_H^{[l]}\times n_W^{[l]}\times n_C^{[l]}
    • Weights: f[l]×f[l]×nC[l1]×nC[l]f^{[l]}\times f^{[l]}\times n_C^{[l-1]}\times n_C^{[l]} (The last quantity is # filters in layer l)
    • bias: nC[l]n_C^{[l]} - (1,1,1,nC[l])(1,1,1,n_C^{[l]})
  • A simple example ConvNet

    • deep-learning-notes_4-1-1
    • Get the final output(7x7x40) and take it as a 1960 vector pass through logistic/softmax to get out actual final value.
  • Types of layer in a convolutional network

    • Convolution (CONV)
    • Pooling (POOL)
    • Fully connected (FC)

4.1.2 Pooling layers

  • No parameters to learn.
  • Max pooling
    • Consider input is 4x4 matrix, output a 2x2 matrix. f(filter) = 2, s(stride) = 2. Just max each 2x2 in the input and put it into one cell in the output matrix.
    • Hyperparameters: f(filter) and s(stride).
    • deep-learning-notes_4-1-2
  • Average pooling
    • Instead of take the maximum, take the average.
    • deep-learning-notes_4-1-2-2
  • Input: nH×nW×nCn_H\times n_W\times n_C
  • Output: nHfs+1×nWfs+1×nC\dfrac{n_H-f}{s}+1\times \dfrac{n_W-f}{s}+1\times n_C Down to the nearest integer.

4.1.3 CNN example

  • Fully Connected layer
    • After several convolutional and pooling layers, the high-level reasoning in the neural network is done via FC layers. The output of the last pooling or convolutional layer, which is typically a multi-dimensional array, is flattened into a single vector of values. This vector is then fed into one or more FC layers.
    • Role:
      • Integration of Learned Features: FC layers combine all the features learned by previous convolutional layers across the entire image. While convolutional layers are good at identifying features in local areas of the input image, FC layers help in learning global patterns in the data.
      • Dimensionality Reduction: FC layers can be seen as a form of dimensionality reduction, where the high-level, spatially hierarchical features extracted by the convolutional layers are compacted into a form where predictions can be made.
      • Classification or Regression: In classification tasks, the final FC layer typically has as many neurons as the number of classes, with a softmax activation function being applied to the output. For regression tasks, the final FC layer's output size and activation function are adjusted according to the specific requirements of the task.
    • Operation is similar to neurons in a standard neural network.
  • Example
    • deep-learning-notes_4-1-3
    • deep-learning-notes_4-1-3-2
  • Why convolutions?
    • Parameter sharing: A feature detector (such as a vertical edge detector) that's useful in one part of the image is probably useful in another part of the image.
    • Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
  • Training set (x(1),y(1))...(x(m),y(m))(x^{(1)},y^{(1)})...(x^{(m)},y^{(m)})
  • Cost J=1mi=1mL(y^(i),y(i))J=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})

4.2 Deep Convolutional Models: Case Studies

4.2.1 Case studies (LeNet-5, AlexNet, VGG, ResNets)

  • Red notations in the image below are what the network original designed but not suitable for nowadays.
4.2.1.1 LeNet-5
  • Pioneer in CNNs: One of the earliest Convolutional Neural Networks, primarily used for digit recognition tasks.
  • Architecture:
    • Consists of 7 layers (excluding input).
    • Includes convolutional layers, average pooling layers, and fully connected layers.
  • Activation Functions: Uses sigmoid and tanh activation functions in different layers. (Not using nowadays)
  • Local Receptive Fields: Utilizes 5x5 convolution filters to capture spatial features.
  • Subsampling Layers: Employs average pooling for subsampling. (Using max pool nowadays)
  • deep-learning-notes_4-1-1-1
4.2.1.2 AlexNet
  • Multiple GPUs in the paper is outdated for today. LRN is not useful after lots of other researches.
  • Deeper Architecture: Contains 8 learned layers, 5 convolutional layers followed by 3 fully connected layers.
  • ReLU Activation: One of the first CNNs to use ReLU (Rectified Linear Unit) activation function for faster training.
  • Overlapping Pooling: Uses overlapping max pooling, reducing the network's size and overfitting.
  • Data Augmentation and Dropout: Employs data augmentation and dropout techniques for better generalization.
  • deep-learning-notes_4-1-1-2
4.2.1.3 VGG-16
  • Simplicity and Depth: Known for its simplicity and depth, with 16 learned layers.
  • Uniform Architecture: Features a very uniform architecture, using 3x3 convolution filters with stride and pad of 1, max pooling, and fully connected layers.
  • Convolutional Layers: Stacks convolutional layers (2-4 layers) before each max pooling layer.
  • Large Number of Parameters: Has a high number of parameters (around 138 million), making it computationally intensive.
  • Transfer Learning: Proved to be an excellent model for transfer learning due to its performance and simplicity.
  • deep-learning-notes_4-1-1-3
4.2.1.4 ResNets
  • Residual block

    • deep-learning-notes_4-2-1-4-1
    • Main Path: a[l]a^{[l]} --> Linear --> ReLU --> a[l+1]a^{[l+1]} --> Linear --> ReLU --> a[l+2]a^{[l+2]}
      • z[l+1]=W[l+1]a[l]+b[l+1]z^{[l+1]}=W^{[l+1]}a^{[l]}+b^{[l+1]} a[l+1]=g(z[l+1])a^{[l+1]}=g(z^{[l+1]}) z[l+2]=W[l+2]a[l+1]+b[l+2]z^{[l+2]}=W^{[l+2]}a^{[l+1]}+b^{[l+2]} a[l+2]=g(z[l+2])a^{[l+2]}=g(z^{[l+2]})
    • Short Cut / Skip Connection: a[l]a^{[l]} --> ReLU --> a[l+2]a^{[l+2]}
      • a[l+2]=g(z[l+2]+a[l])a^{[l+2]}=g(z^{[l+2]}+a^{[l]})
  • In normal plain network, the training error with increasing number of layers in theory will continuesly decrease. But in reality it will decrease but increase after a sweet point. What ResNet performs is decreasing training error with numbers of layers increase and the training error not increasing again.

  • Why do residual networks work?

    • deep-learning-notes_4-2-1-4-2

    • Residual networks introduce a shortcut or skip connection that allows the network to learn identity functions effectively.

    • This is crucial for training very deep networks by avoiding the vanishing gradient problem.

    • In a residual block:

    • XX -> BigNN -> a[l]a^{[l]} -> Residual block -> a[l+2]a^{[l+2]}

    • Input XX is passed through a standard neural network (BigNN) to obtain a[l]a^{[l]}, and then it goes through the residual block to produce a[l+2]a^{[l+2]}.

    • The formulation of a residual block can be represented as: a[l+2]=g(z[l+2]+a[l])=g(w[l+2]a[l+1]+b[l+2]+a[l])a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})

      • Here, gg is the activation function.
      • z[l+2]z^{[l+2]} is the output of the layer just before the activation function.
      • w[l+2]w^{[l+2]} and b[l+2]b^{[l+2]} are the weight and bias of the layer, respectively.
      • If w[l+2]=0w^{[l+2]} = 0 and b[l+2]=0b^{[l+2]} = 0, then a[l+2]=g(a[l])=a[l]a^{[l+2]} = g(a^{[l]}) = a^{[l]}, effectively allowing the network to learn the identity function.
    • In cases where the dimensions of a[l+2]a^{[l+2]} and a[l]a^{[l]} differ (e.g., a[l]R128a^{[l]} \in \mathbb{R}^{128} and a[l+2]R256a^{[l+2]} \in \mathbb{R}^{256}), a linear transformation wsw_s (e.g., wsR256×128w_s \in \mathbb{R}^{256 \times 128}) is applied to a[l]a^{[l]} to match the dimensions.

    • This architecture enables training deeper models without performance degradation, which was a significant challenge in deep learning before the development of ResNet.

    • Understand through backdrop(personal notes not from the class content)

      • Consider input as x, the residual block calculation as F(x), identity mapping just drag the x and add it to the residual block's calculation which makes the final value y=F(x)+xy=F(x)+x
      • Backprop for this will be as follow
        • Gradient of the Residual Blokc's Output: yw\dfrac{\partial y}{\partial w}
          • This represents the gradient of the output yy with respect to the weights ww.
        • By chain rule: yw=yF(x)F(x)xxw+yxxw\dfrac{\partial y}{\partial w} = \dfrac{\partial y}{\partial F(x)}\dfrac{\partial F(x)}{\partial x}\dfrac{\partial x}{\partial w} + \dfrac{\partial y}{\partial x}\dfrac{\partial x}{\partial w}
        • Since y=F(x)+xy=F(x)+x, yF(x)\dfrac{\partial y}{\partial F(x)} and yx\dfrac{\partial y}{\partial x} should be 1
        • So the formula become yw=F(x)xxw+xw\dfrac{\partial y}{\partial w} = \dfrac{\partial F(x)}{\partial x}\dfrac{\partial x}{\partial w} + \dfrac{\partial x}{\partial w}
      • Compare to without the identity mapping xx added. yw=F(x)xxw\dfrac{\partial y}{\partial w} = \dfrac{\partial F(x)}{\partial x}\dfrac{\partial x}{\partial w}, there is a xw\dfrac{\partial x}{\partial w} less. Add this xx to F(x)F(x) makes the network will not get worse results compare to before.

4.2.2 Network in Network and 1 X 1 convolutions

  • 1x1 convolutions
    • Functionality of 1x1 Convolutions: A 1x1 convolution, despite its simplicity, acts as a fully connected layer applied to each pixel separately across depth. It's effectively used for channel-wise interactions and dimensionality reduction.
    • Increasing Network Depth: 1x1 convolutions can increase the depth of the network without a significant increase in computational complexity.
    • Dimensionality Reduction: They are often used for reducing the number of channels (depth) before applying expensive 3x3 or 5x5 convolutions, thus reducing the computational cost.
    • Feature Re-calibration: 1x1 convolutions can recalibrate the feature maps channel-wise, enhancing the representational power of the network.
  • Using 1x1 convolutions:
    • Reduce dimension: Consider a 28x28x192 input with CONV 1x1 with 32 filters, the output will be 28x28x32.

4.2.3 Inception network

  • Motivation for inception network
    • Input 28x28x192
      • Use 1x1x192 with 64 filters, output 28x28x64
      • Use same dimension 3x3x192, output 28x28x128
      • Use same dimension 5x5x192, output 28x28x32
      • use same dimension and s=1 Max-Pool, output 28x28x32.
    • Final output 28x28x256.
    • The problem of computational cost (Consider 5x5x192)
      • 5x5x192x28x28x32 is really big, 120M.
      • Bottleneck layer (Using 1x1 convolution): shrink 28x28x192 --> CONV, 1x1, 16, 1x1x192 --> 28x28x16 (Bottleneck layer) --> CONV 5x5, 32, 5x5x16 --> 28x28x32
      • In total only 28x28x16+28x28x32x5x5x16=12.4M
  • Inception moule
    • deep-learning-notes_4-2-1
    • deep-learning-notes_4-2-1-2
    • The softmax in the itermediate position is used for regularization which is used avoid overfitting.

4.2.4 MobileNet

  • Depthwise Separable Convolution

    • Depthwise Convolution
      • Computational cost = #filter params x #filter positions x #of filters
    • Pointwise Convolution
      • Computational cost = #filter params x #filter positions x # of filters
        • ncncfiltersn_c * n_c * filters
    • Cost of depthwise separable convolution / normal convolution
      • 1nc+1f2\dfrac{1}{n_c} + \dfrac{1}{f^2}
  • MobileNet v2 Bottleneck

    • Residual Connection

      • MobileNet v2 Bottleneck
      • Expansion
      • Depthwise
      • Pointwise (Projection)
    • Similar computational cost as v1

      • MobileNet V2 improves upon V1 by introducing an inverted residual structure with linear bottlenecks, which enhances the efficiency of feature extraction and information flow through the network. This architectural advancement allows V2 to achieve better performance than V1, despite having similar computational costs. Essentially, V2 optimizes the way features are processed and combined, providing more effective and complex feature representation within the same computational budget as V1.

4.2.5 EfficientNet

  • EfficientNet is a series of deep learning models known for high efficiency and accuracy in image classification tasks.
  • Compound Scaling:
    • It introduces a novel compound scaling method, scaling network depth, width, and resolution uniformly with a set of fixed coefficients.
  • High Efficiency and Accuracy:
    • EfficientNets provide state-of-the-art accuracy for image classification while being more computationally efficient compared to other models.

4.2.6 Inception network

  • Transfer Learning
    • Small training set: Freeze all hidden layers (save to disk), only train the softmax unit.
    • Big training set: Freeze less hidden layers, train some of the hidden layers (or use new hidden units), and also own softmax unit.
    • Lots of data: Use the already trained weights and bias as initialization, re-train based on it, as well as the softmax unit.
  • Data augmentation
    • Common augmentation method: Mirroring, Random Cropping, (Rotation, Shearing, Local warping, ...)
    • Color shifting: add/minus from RGB. Advanced: PCA / PCA color augmentation.
    • Implementing distortions during training: One CPU thread doing augmentation, and other threads or GPU doing the training at same time.
  • State of CV
    • Data needed (little data to lots of data): Object detection < Image recognition < Speech recognition
    • Lots of data - Simpler algorithms (Less hand-engineering)
    • Little data - more hand-engineering ("hacks") - Transfer learning
    • Two sources of knowledge
      • Labeled data
      • Hand engineered features/network architecture/other components
    • Tips for doing well on benchmarks/winning competitions
      • Ensembling: Train several networks independently and average their outputs (y^\hat{y}) 1-2% better. (3-15 networks)
      • Multi-crop at test time: Run classifier on multiple versions of test images and average results. (10-crop: center, four corner, also on mirror image the same 5 crops)
    • Use open source code
      • Use architectures of networks published in the literature.
      • Use open source implementations if possible.
      • Use pretrained models and fine-tune on your dataset.

4.3 Object Detection

4.3.1 Object localization

  • Want to detect 4 class: 1-pedestrian, 2-car, 3-motorcycle, 4-background.
  • Defining the target label y: Need to out put bx,by,bh,bwb_x, b_y, b_h, b_w, class label (1-4). (In total 9 elements in the output vector).
  • y=[pc,bx,by,bh,bw,c1,c2,c3]y=[p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3]
    • There is an object y=[1,bx,by,bh,bw,c1,c2,c3]y=[1, b_x, b_y, b_h, b_w, c_1, c_2, c_3]
    • No object y=[0,?,?,?,?,?,?,?]y=[0, ?, ?, ?, ?, ?, ?, ?] Don't care for all of other
  • Lost function:
    • L(y^,y)=(y1^y1)2+(y2^y2)2+...+(y8^y8)2L(\hat{y}, y)=(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + ... + (\hat{y_8} - y_8)^2 if y1=1y_1=1
    • L(y^,y)=(y1^y1)2L(\hat{y}, y)=(\hat{y_1} - y_1)^2 if y1=0y_1=0

4.3.2 Landmark detection

  • Annotate key positions (points-xy coordinate) as landmarks.

4.3.3 Object detection

  • Object detection

    • Starts with closely crops images.
    • A window sliding from the top left to bottom right, once and once. If not find increase the window's size and redo the sliding.
    • Run each individual image to the convnet.
  • Turning FC layer into convolutional layers

    • Instead directly to FC, use conv filter.
  • Convolution implementation of sliding windows

    • Convolution implementation of sliding windows
    • Instead of do 4 times 14x14x3, new conv fc share the computation, directly using the 2x2x4.
  • Output accurate bounding boxes

    • YOLO algorithm

      • Find the medium point of target and working into the boundary box that contains that point.
      • YOLO algorithm
      • YOLO algorithm-2
    • Intersection over union (IoU)

      • Use to check accuracy.
      • Size of intersection / size of reunion (normally "Correct" if loU \geq 0.5)
      • loU
    • Non-max suppression

      • Leave the maximum accuracy one, suppress all with high IoU.
      • Non-max suppression-1
      • Non-max suppression-2
    • Anchor Boxes

      • Predefine anchor boxes, associate objects with anchor boxes.
      • If objects more than assigned anchor boxes, not works. Not same shape, not works.
      • Anchor Boxes-1
      • Anchor Boxe-2
    • Training set

      • y is 3x3x2x8 (which is # of grids x # of anchors x # classes(5(pc.bx,by,bh,bwp_c. b_x, b_y, b_h, b_w) + classes))
      • YOLO
  • Regision Proposals

    • R-CNN: Propose regions. Classify proposed regions one at a time. Output label + bounding box.
    • Fast R-CNN: Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
    • Faster R-CNN: Use convolutional network to propose regions.
  • Semantic Segmentation with U-Net

    • Per-pixel class labels

      • Per-pixel class labels
    • Deep Learning for Semantic Segmentation

      • Deep Learning for Semantic Segmentation
    • Transpose Convolution

      • Increase the image size.
      • Transpose Convolution - 1
      • Transpose Convolution - 2
    • U-Net Architecture

      • Skip Connections: Left one get more details in color or anything like that. Right one is more spatial information to figure out where is the object really is.
      • U-Net Architecture - Skip Connections
      • U-Net Architecture

4.4 Special Applications: Face Recognition & Neural Style Transfer

4.4.1 Face recognition

  • Face verification vs. face recognition

    • verification vs recognition ---- 1:1 vs 1:K
    • Verification
      • Input image, name/ID.
      • Output whether the input image is that of the claimed person.
    • Recognition
      • Has a database of K persons
      • Get an input image
      • Output ID if the image is any of the K persons (or "not recognized")
  • One-shot learning

    • Learning from one example to recognize the person again.
    • Learning a "similarity" function
      • d(img1, img2) = degree of difference between images
      • If d(img1, img2) τ\le \tau "same" \textgreater τ\textgreater \space \tau "Different"
  • Siamese network

    • Siamese network
    • Input two differnet images into two CNN and ge the result of them.
    • Such as input x(1),x(2)x^{(1)}, x^{(2)} seperately into two different CNN, and the output will be the encoding of each of them f(x(1)),f(x(2))f(x^{(1)}), f(x^{(2)})
    • Then compare the distance between them d(x(1),x(2))=f(x(1))f(x(2))22d(x^{(1)}, x^{(2)}) = ||f(x^{(1)}) - f(x^{(2)})||_2^2
    • Parameters of NN define an encoding f(x(i))f(x^{(i)})
    • Learn parameters so that:
      • If x(i),x(j)x^{(i)}, x^{(j)} are the smae person, f(x(i))f(x(j))2||f(x^{(i)}) - f(x^{(j)})||^2 is small.
      • If x(i),x(j)x^{(i)}, x^{(j)} are the different person, f(x(i))f(x(j))2||f(x^{(i)}) - f(x^{(j)})||^2 is large..
  • Triplet Loss

    • Learning objective: (Anchor, Positive), (Anchor, Negative)

      • Want: f(A)f(P)2+αf(A)f(N)2||f(A) - f(P)||^2 + \alpha \le ||f(A) - f(N)||^2 α\alpha is the margin (similar to SVM)
      • f(A)f(P)2f(A)f(N)2+α0||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha \le 0
    • Loss function

      • Given 3 images A, P, N:
      • L(A,P,N)=max(f(A)f(P)2f(A)f(N)2+α,0)L(A, P, N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha, 0)
      • J=i=1mL(A(i),P(i),N(i))J = \sum\limits_{i=1}^m L(A^{(i)},P^{(i)},N^{(i)})
    • If have a training set of 10K pictures of 1k persons. Put those 10K into triplet A, P, N, then put into the loss function.

    • Choosing the triplets A, P, N

      • During training, if A, P, N are chosen randomly, d(A,P)+αd(A,N)d(A,P) +\alpha \le d(A, N) is easily satisfied.
      • Choose triplets that're "hard" to train on. (such as choose d(A,P)d(A,N)d(A,P) \approx d(A,N))
    • Training set using triplet loss to make J smaller. And make distance of d for same person small and different large.

  • Face Verification and Binary Classification

    • Learning the similarity function
    • y^=σ(k=1128wkf(x(i))kf(x(j))k+b)\hat{y} = \sigma (\sum\limits_{k=1}^{128}w_k|f(x^{(i)})_k-f(x^{(j)})_k| + b)
    • Only store the f(x(j))f(x^{(j)}) as pre-compute, save storage and computational resources.
    • Face verification supervised learning.

4.4.2 Neural style transfer

  • What is it?
    • Neural style transfer
  • Cost Function
    • J(G)=αJcontent(C,G)+βJStyle(S,G)J(G) = \alpha J_{content}(C, G) + \beta J_{Style}(S, G)
    • Find the generated image G
      1. Initiate G randomly G: 100x100x3
      2. Use gradient descent to minimize J(G) G:=GGJ(G)G:=G-\dfrac{\partial}{\partial G}J(G)
  • Content Cost Function
    • J(G)=αJcontent(C,G)+βJStyle(S,G)J(G) = \alpha J_{content}(C, G) + \beta J_{Style}(S, G)
    • Layer Selection
      • Shallow layers in the network capture basic features and structures.
      • Deeper layers capture more high-level details and textures.
      • Typically, a layer in the middle of the network is chosen to balance between basic structures and detailed features.
    • Use pre-trained Convent. (E.g., VGG netwok)
    • Let a[l](C) and a[l](G) be the activation of layer ll on the image.
    • If a[l](C) and a[l](G) are similar, both images have similar content.
    • Jcontent(C,G)=12a[l]Ca[l]G2J_{content}(C, G) = \dfrac{1}{2}||a^{{[l]}{C}} - a^{{[l]}{G}}||^2
  • Style Cost Function
    • Meaning of the "style" of an image: Say you are using layer l's activation to measure "style". Degine style as correlation between activations across channels.
    • Style matrix (G --> Gram matrix)
      • Let ai,j,k[l]=a_{i,j,k}^{[l]}= activation at (i,j,k)(i,j,k). G[l]G^{[l]} is nc[l]n_c^{[l]} x nc[l]n_c^{[l]}
        • Gkk[l](S)=i=1nH[l]j=1nW[l]aijk[l](S)aijk[l](S)G_{kk^{\prime}}^{[l](S)} = \sum\limits_{i=1}^{n_{H}^{[l]}}\sum\limits_{j=1}^{n_{W}^{[l]}} a_{ijk}^{[l](S)} a_{ijk\prime}^{[l](S)}
        • Gkk[l](G)=i=1nH[l]j=1nW[l]aijk[l](G)aijk[l](G)G_{kk^{\prime}}^{[l](G)} = \sum\limits_{i=1}^{n_{H}^{[l]}}\sum\limits_{j=1}^{n_{W}^{[l]}} a_{ijk}^{[l](G)} a_{ijk\prime}^{[l](G)}
      • Jstyle[l](S,G)=G[l](S)G[l](G)F2J_{style}^{[l]}(S,G)=||G^{[l](S)} - G^{[l](G)}||_F^2
        • =1(2nH[l]nW[l]nC[l])2kk(Gkk[l](S)Gkk[l](G))2=\dfrac{1}{(2n_{H}^{[l]} n_{W}^{[l]} n_{C}^{[l]})^2} \sum_k\sum_{k\prime}(G_{kk^{\prime}}^{[l](S)} - G_{kk^{\prime}}^{[l](G)})^2
      • JStyle(S,G)=lλ[l]JStyle[l](S,G)J_{Style}(S,G) = \sum\limits_l\lambda^{[l]}J_{Style}^{[l]}(S,G)
  • 1D and 3D Generalizations
    • 1D: 14 x 1(ncn_c Channels) * 5 x 1 --> 10x16 (16 filters)
    • 3D: 14x14x14 x 1(ncn_c Channels) * 5x5x5 x 1 --> 10x10x10 x 16 (16 filters)

V. Sequence Models

5.1 Recurrent Neural Networks

5.1.1 RNN model

  • Notation
    • Input x: x<1>,x<2>,...,x<t>x^{<1>}, x^{<2>}, ..., x^{<t>}
    • Total Length of input: TxT_x
    • Output y: y<1>,y<2>,...,y<t>y^{<1>}, y^{<2>}, ..., y^{<t>}
    • Total Length of output: TyT_y
  • Recurrent Neural Network
    • a<0>=Oa^{<0>} = \vec{O}
    • a<1>=g1(waaa<0>+waxx<1>+ba)a^{<1>}=g_1(w_{aa}a^{<0>}+w_{ax}x^{<1>}+b_a) <-- Tanh/ReLu
    • y^<1>=g2(wyaa<1>+by)\hat{y}^{<1>}=g_2(w_{ya}a^{<1>}+b_y) <-- Sigmoid
    • a<t>=g(waaa<t1>+waxx<t>+ba)=g(wa[a<t1>,x<t>]+ba)a^{<t>}=g(w_{aa}a^{<t-1>}+w_{ax}x^{<t>}+b_a)=g(w_a[a^{<t-1>},x^{<t>}]+b_a)​ where [waa,wax]=wa[w_{aa}, w_{ax}]=w_a
      • RNN-Simplify Notation
    • y^<t>=g2(wyaa<t>+by)=g(wya<t>+by)\hat{y}^{<t>}=g_2(w_{ya}a^{<t>}+b_y)=g(w_ya^{<t>}+b_y)
    • Recurrent Neural Network

5.1.2 Backpropagation through time(BPTT)

  • Total Loss for the Sequence: L(y^,y)=t=1TyL<t>(y^<t>,y<t>)L(\hat{y}, y) = \sum\limits^{T_y}_{t=1}L^{<t>}(\hat{y}^{<t>}, y^{<t>})

  • Loss at Each Timestep: L<t>(y^<t>,y<t>)=y<t>logy^<t>(1y<t>)log(1y^<t>)L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -y^{<t>}log\hat{y}^{<t>}-(1-y^{<t>})log(1 - \hat{y}^{<t>})

  • Gradient Calculations for BPTT:

    • Gradient of the Loss w.r.t. Output at Each Timestep: L<t>y^<t>=y<t>y^<t>+1y<t>1y^<t>\dfrac{\partial L^{<t>}}{\partial \hat{y}^{<t>}} = -\dfrac{y^{<t>}}{\hat{y}^{<t>}} + \dfrac{1 - y^{<t>}}{1 - \hat{y}^{<t>}}

    • Gradient of the Loss w.r.t. Weights at Each Timestep:

      LW=t=1TyL<t>y^<t>y^<t>W\dfrac{\partial L}{\partial W} = \sum_{t=1}^{T_y} \dfrac{\partial L^{<t>}}{\partial \hat{y}^{<t>}} \cdot \dfrac{\partial \hat{y}^{<t>}}{\partial W}

    • Gradient of the Loss w.r.t. Biases at Each Timestep: Lb=t=1TyL<t>y^<t>y^<t>b\dfrac{\partial L}{\partial b} = \sum_{t=1}^{T_y} \dfrac{\partial L^{<t>}}{\partial \hat{y}^{<t>}} \cdot \dfrac{\partial \hat{y}^{<t>}}{\partial b}

  • Weight Update in Gradient Descent:

    • Update Rule for Weights: W:=WαLWW := W - \alpha \dfrac{\partial L}{\partial W}
    • Update Rule for Biases: b:=bαLbb := b - \alpha \dfrac{\partial L}{\partial b}

5.1.3 Different types of RNNs

  • Types

    • One to one:
      • The standard form of a neural network where there is one input and one output.
      • Example: A basic classification or regression task where each input data point maps to a single output.
    • One to many:
      • One input leading to multiple outputs.
      • Example: Music generation, where a single starting note or chord can generate a sequence of musical notes.
    • Many to one:
      • Multiple inputs lead to one output.
      • Example: Sentiment analysis in text, where a sequence of words (input) is used to determine a sentiment label (output).
    • Many to many (Same length):
      • An architecture where the input and output sequences are of equal length.
      • Example: Video frame labeling, where each frame in a video (input) is labeled with a tag (output), or synchronous machine translation where each input word corresponds to an output word.
    • Many to many (Different length):
      • The input and output sequences can be of different lengths.
      • Example: Machine translation, where a sentence in one language (input) is translated into a sentence in another language (output). The lengths of the input and output sentences can vary.
    • RNN Types
  • Learning Model and Sequence Generation:

    • Speech recognition: calculate the probability based on words before.
      • Speech Recognition
  • Sampling Novel Sequences:

    • Generating Based on Previous Word:
      • In this process, a model generates a sequence (like a sentence) word by word.
      • Each new word is sampled based on the probabilities provided by the model, considering the previously generated words.
    • Conditional Probability: The model predicts each subsequent word based on the conditional probability given the sequence of words already generated.
    • Use of <EOS> Token:
      • Generation continues until the model produces a special end-of-sequence token, often denoted as <EOS>.
      • This token indicates that the model has completed the sequence, and no further words are needed.
    • Applications: This technique is commonly used in text generation tasks like story generation, chatbots, and language modeling.
  • Character-level language model:

    • Vocabulary: Alphabet-Based and Words-Based.
  • Vanishing Gradients with RNNs:

    • Long-Term Dependencies: In RNNs, earlier words in a sequence can have significant effects on later words, such as determining singular or plural forms or setting the context for subsequent words.
    • Challenge of Long-Term Dependencies: RNNs often struggle with maintaining information from early inputs in a sequence when predicting later outputs. This is due to the vanishing gradient problem, where gradients become increasingly small as they are propagated back through time, making it difficult for early inputs to significantly influence the network's learning.
    • Local Influences: In many cases, the outputs of RNNs are heavily influenced by inputs that are temporally close. As a result, the network might fail to recognize and learn dependencies between elements that are far apart in the sequence.
  • Exploding Gradients in RNNs

    • Problem Description: RNNs can also experience exploding gradients, where gradients become excessively large. This can lead to unstable training processes and wildly oscillating network weights.
    • Gradient Clipping Solution: A common solution to the exploding gradients problem is gradient clipping. This involves setting a threshold value, and if the gradients exceed this threshold during backpropagation, they are "clipped" or rescaled to keep them within a manageable range.

5.1.4 Gated Recurrent Unit (GRU)

  • C = memory cell - c<t>=a<t>c^{<t>}=a^{<t>}​​ (Just for now, LSTM different)
    • In GRUs, the concept of a memory cell is simplified compared to LSTMs.
  • Candidate Memory: c~<t>=tanh(wc[c<t1>,x<t>]+bc)\tilde{c}^{<t>}=tanh(w_c [c^{<t-1>}, x^{<t>}] + b_c)​​ Previous memory and current input.
    • Sometime in paper shows: h~\tilde{h}
    • This represents the candidate memory, computed as a combination of the previous memory and the current input.
  • Update Gate: Γu=σ(wu[c<t1>,x<t>]+bu)\Gamma_u=\sigma (w_u[c^{<t-1>}, x^{<t>}] + b_u)
    • Decides how much of the candidate memory should be used to update the current memory cell.
  • Relevance Gate: Γr=σ(wr[c<t1>,x<t>]+br)\Gamma_r=\sigma (w_r[c^{<t-1>}, x^{<t>}] + b_r)
    • It decides how much of the past information (from previous timesteps) needs to be forgotten.
  • c<t>=Γuc~<t>+(1Γu)c<t1>c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}​ (*​​ - element-wise)
    • Sometime in paper show: hh
    • The current memory cell is updated by an element-wise multiplication of the update gate with the candidate memory, and the complementary part of the update gate with the previous memory.
    • This formula allows the GRU to retain or forget information effectively, balancing between the candidate memory and the previous memory.

GRU-Sample-1

GRU-Sample-2

Understanding Gated Recurrent Unit (GRU) |  Medium

My own understanding (based on the above image):

  1. Input and Hidden State: xtx_t and ht1h_{t-1}

    These will be used as inputs to both the reset gate (rtr_t) and update gate (ztz_t).

  2. Reset Gate (rtr_t): How much past information to forget

    rt=σ(Wrxt+Urht1)r_t = \sigma(W_r x_t + U_r h_{t-1})

    • Input xtx_t (blue line, bottom to top): multiplied with weight matrix WrW_r.
    • Hidden state ht1h_{t-1} (orange line, top to bottom): multiplied with weight matrix UrU_r.
    • The results are added, then passed through a sigmoid activation σ\sigma to compute rtr_t.
    • The closer rtr_t is to 0, the less past information is retained.
  3. Update Gate (ztz_t): How much of the new memory to keep

    zt=σ(Wzxt+Uzht1)z_t = \sigma(W_z x_t + U_z h_{t-1})

    • Input xtx_t: multiplied with weight matrix WzW_z.
    • Hidden state ht1h_{t-1}: multiplied with weight matrix UzU_z.
    • The two products are summed and passed through a sigmoid activation to get ztz_t.
    • The closer ztz_t is to 1, the more new information is incorporated.
  4. Candidate Hidden State (h~t\tilde{h}_t)

    h~t=tanh(Wxt+U(rtht1))\tilde{h}_t = \tanh(W x_t + U (r_t \odot h_{t-1}))

    • Compute element-wise product of rtr_t and ht1h_{t-1}: rtht1r_t \odot h_{t-1}.
    • Multiply input xtx_t with weight matrix WW.
    • Multiply the reset-modified hidden state with weight matrix UU.
    • Sum the two results and apply tanh\tanh activation to get h~t\tilde{h}_t.
    • This is the "candidate" memory content to write.
  5. Final Hidden State (hth_t): Combine old and new memory

    ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

    • Left part: 1zt1 - z_t multiplied with ht1h_{t-1} (retain old memory).
    • Right part: ztz_t multiplied with h~t\tilde{h}_t (add new memory).
    • Combine both to get the final hidden state hth_t.

5.1.5 Long Short Term Memory (LSTM)

  • c~<t>=tanh(wc[a<t1>,x<t>]+bc)\tilde{c}^{<t>}=tanh(w_c [a^{<t-1>}, x^{<t>}] + b_c)
  • Update Gate: Γu=σ(wu[a<t1>,x<t>]+bu)\Gamma_u=\sigma (w_u[a^{<t-1>}, x^{<t>}] + b_u)
  • Forget Gate: Γf=σ(wf[a<t1>,x<t>]+bf)\Gamma_f=\sigma (w_f[a^{<t-1>}, x^{<t>}] + b_f)
  • Output Gate: Γo=σ(wo[a<t1>,x<t>]+bo)\Gamma_o=\sigma (w_o[a^{<t-1>}, x^{<t>}] + b_o)
  • c<t>=Γuc~<t>+Γfc<t1>c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>}
  • a<t>=a^{<t>}= Γotanh (c<t>)\Gamma_o * {tanh} \space (c^{<t>})

LSTM-1

LSTM-2

  • Peephole Connection
    • Add the c<t1>c^{<t-1>} into u/f/o. Here is an example for output gate.
    • Γo=σ(wo[a<t1>,x<t>,c<t1>]+bo)\Gamma_o=\sigma (w_o[a^{<t-1>}, x^{<t>}, c^{<t-1>}] + b_o)

Long Short-Term Memory (LSTM) | Medium

My own understanding (based on the above image):

  1. Input, Hidden State, and Cell State: xtx_t, ht1h_{t-1}, and ct1c_{t-1}

    These are used together to compute various gates (forget gate, update gate, and output gate) in the LSTM.

  2. Forget Gate (ftf_t): How much past memory to forget

    ft=σ(Wfxt+Ufht1+bf)f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)

    • Input xtx_t: multiplied with weight matrix WfW_f.
    • Hidden state ht1h_{t-1}: multiplied with weight matrix UfU_f.
    • Add both results and apply sigmoid activation to get ftf_t.
    • Controls how much of the previous cell state ct1c_{t-1} to retain.
  3. Input Gate (iti_t): How much new information to write

    it=σ(Wixt+Uiht1+bi)i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)

    • Input xtx_t and hidden state ht1h_{t-1} processed with weights WiW_i and UiU_i.
    • Output is passed through sigmoid to get iti_t.
    • Controls how much new candidate memory to allow in.
  4. Candidate Cell State (c~t\tilde{c}_t)

    c~t=tanh(Wcxt+Ucht1+bc)\tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)

    • Input xtx_t and hidden state ht1h_{t-1} transformed and summed.
    • Apply tanh to generate candidate memory c~t\tilde{c}_t.
  5. Update Cell State (ctc_t): Combine old and new memory

    ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

    • Forget gate ftf_t decides how much of the old memory ct1c_{t-1} to keep.
    • Input gate iti_t decides how much of the candidate memory c~t\tilde{c}_t to write in.
    • Combine both to get updated cell state ctc_t.
  6. Output Gate (oto_t): What to output from the cell

    ot=σ(Woxt+Uoht1+bo)o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)

    • Controls how much of the current cell state to expose as output.
  7. Final Hidden State (hth_t): Compute output

    ht=ottanh(ct)h_t = o_t \odot \tanh(c_t)

    • Cell state ctc_t is passed through tanh, then modulated by output gate oto_t.
    • The result is the final hidden state hth_t, which can be passed to the next time step or used as output.

peephole connection LSTM | ResearchGate

Peephole Connection LSTM:

Add ct1c_{t-1} to gates, element-wise multiply. Peephole connections allow LSTM gates to directly access the cell state, enabling more precise timing and memory control in sequential tasks.

ft=σ(Wfxt+Ufht1+Vfct1+bf)f_t = \sigma(W_f x_t + U_f h_{t-1} + V_f \odot c_{t-1} + b_f)

it=σ(Wixt+Uiht1+Vict1+bi)i_t = \sigma(W_i x_t + U_i h_{t-1} + V_i \odot c_{t-1} + b_i)

ot=σ(Woxt+Uoht1+Voct+bo)o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o \odot c_t + b_o)

5.1.6 Bidirectional RNN (BRNN)

  • Normal RNN cannot determine a word's meaning from the words or sentences behind it.
  • Cons: For speack recognition, need to wait for the person stop talking the sentence.

BRNN

  1. Input and Hidden States: xtx_t, ht1\overrightarrow{h}_{t-1}, and ht+1\overleftarrow{h}_{t+1}

    In a BRNN, the input sequence is processed in both forward and backward directions:

    • The forward RNN computes hidden states from left to right: ht\overrightarrow{h}_t

    • The backward RNN computes hidden states from right to left: ht\overleftarrow{h}_t

    • The final hidden state at time step tt is typically the concatenation (or sometimes sum) of both:

      ht=[ht;ht]h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t]

  2. Forward Hidden State (ht\overrightarrow{h}_t): Process input left to right

    ht=tanh(Wx(f)xt+Wh(f)ht1+b(f))\overrightarrow{h}_t = \tanh(W_x^{(f)} x_t + W_h^{(f)} \overrightarrow{h}_{t-1} + b^{(f)})

    • Processes input from t=1t = 1 to TT.
    • Uses forward weights Wx(f),Wh(f),b(f)W_x^{(f)}, W_h^{(f)}, b^{(f)}.
    • Computes hidden state based on current input and previous forward hidden state.
  3. Backward Hidden State (ht\overleftarrow{h}_t): Process input right to left

    ht=tanh(Wx(b)xt+Wh(b)ht+1+b(b))\overleftarrow{h}_t = \tanh(W_x^{(b)} x_t + W_h^{(b)} \overleftarrow{h}_{t+1} + b^{(b)})

    • Processes input from t=Tt = T to 11 (reversed).
    • Uses backward weights Wx(b),Wh(b),b(b)W_x^{(b)}, W_h^{(b)}, b^{(b)}.
    • Computes hidden state based on current input and next backward hidden state.
  4. Final Output (yty_t): Merge forward and backward context

    yt=g([ht;ht])y_t = g([\overrightarrow{h}_t ; \overleftarrow{h}_t])

    • Combines both directions by concatenating or summing ht\overrightarrow{h}_t and ht\overleftarrow{h}_t.
    • Function g()g(\cdot) can be:
      • A linear layer for classification
      • A projection layer
      • Or directly used as output in sequence models

5.1.7 Deep RNNs

Deep RNNs

Each layer is a separate RNN with its own parameters: ht(l)=tanh(Wx(l)xt(l)+Wh(l)ht1(l)+b(l))h_t^{(l)} = \tanh(W_x^{(l)} x_t^{(l)} + W_h^{(l)} h_{t-1}^{(l)} + b^{(l)})

  • xt(l)x_t^{(l)}: input at layer ll (either original input or lower-layer output)
  • ht1(l)h_{t-1}^{(l)}: previous time step’s hidden state in the same layer
  • Each layer has its own weights: Wx(l),Wh(l),b(l)W_x^{(l)}, W_h^{(l)}, b^{(l)}

5.2 Natural Language Processing & Word Embeddings

5.2.1 Word Representation

  • 1-hot representation (1-d vector)

    • O5391O_{5391}​ means this is in a 1d vector at 5391th position.
  • Featurized representation: word embedding

    • multiple dimensions (Normally 300-dimension)
    • Featurized representation
  • Visualizing word embeddings (t-SNE)

    • Similar to cluster
    • t-SNE
  • Using Word Embedding

    • Transfer learning and word embeddings

      1. Learn word embeddings from large text corpus. (1-100B words)

        (Or download pre-trained embedding online.)

      2. Transfer embedding to new task with smaller training set.

        (Say, 100k words)

      3. Optional: Continue to finetune the word embedding with new data.

  • Properties of word embeddings

    • Analogies using word vectors

      • Analogies using word vectors

      • Cosine similarity

      • sim(ew,ekingeman+ewoman)sim(e_w, e_{king} - e_{man}+e_{woman})

        • sim(u,v)=uTvu2v2sim(u,v)=\dfrac{u^Tv}{||u||_2||v||_2}
        Cosine similarity

5.2.2 Embedding matrix

Embedding matrix

5.2.3 Learning Word Embeddings: Word2vec

  • Skip-grams
    • Vocab size = 10,000
    • OcO_c -> EE -> ece_c -> osoftmaxo_{softmax} -> y^\hat{y}
      • OcO_c: 1-hot; EE: Embedding matrix; ece_c embedding vector
      • osoftmaxo_{softmax}: output of softmax layer (probability); y^\hat{y}: the output.
    • Softmax: p(tc)=eθtTecj=110,000eθjTecp(t|c) = \dfrac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000}e^{\theta_j^T e_c}}
      • θt\theta_t​ = parameter associated with output t.
      • This softmax function describe in given context work cc, the probability of tt become the target word.
    • L(y^,y)=i=110,000yilogyi^L(\hat{y},y)=-\sum_{i=1}^{10,000}y_i log\hat{y_i}
    • How to sample the context c?
      • Different heuristics that can use in order to balance out something from the common words together with less common words.
    • Hierarchical Softmax classifier: logvlog|v|, common words more on top.
    • Hierarchical Softmax
  • Negative Sampling: Only use some negative samples use softmax to train skip-gram. Avoid calculation on the whole corpus.

Skip Gram

My own understanding (Word2vec with Skip-gram):

  1. Input and Target Words: wtw_t (center word) and wt+jw_{t+j} (context word)

    • The Skip-gram model receives a single input word wtw_t (center word).
    • The training objective is to predict one or more context words wt+jw_{t+j} where j[c,c],j0j \in [-c, c], j \ne 0 and cc is the context window size.
  2. Input Word Representation: One-hot encoding and Embedding Lookup

    • The input word wtw_t is first represented as a one-hot vector xRVx \in \mathbb{R}^{V}, where VV is the vocabulary size.
    • This vector is multiplied by an embedding matrix ERV×dE \in \mathbb{R}^{V \times d} to retrieve the dense word vector: vwt=Ex=EwtRdv_{w_t} = E^\top x = E_{w_t} \in \mathbb{R}^{d}
    • vwtv_{w_t} is the embedding vector for the input word.
  3. Output Layer: Predicting context word probabilities

    • The model tries to compute a score for every word in the vocabulary as a possible context word.
    • This is done by multiplying the embedding vector with an output weight matrix URV×dU \in \mathbb{R}^{V \times d} and applying a softmax function: P(wcontextwt)=softmax(Uvwt)=exp(Uwvwt)wVexp(Uwvwt)P(w_{context} \mid w_t) = \text{softmax}(U v_{w_t}) = \frac{\exp(U_{w}^\top v_{w_t})}{\sum_{w' \in V} \exp(U_{w'}^\top v_{w_t})}
    • For each actual context word wt+jw_{t+j}, we maximize this probability.
  4. Loss Function: Cross-entropy over context words

    • The loss is the negative log-likelihood over the context window: L=cjc,  j0logP(wt+jwt)\mathcal{L} = - \sum_{-c \leq j \leq c,\; j \ne 0} \log P(w_{t+j} \mid w_t)
    • This encourages the model to place actual context words close to the center word in vector space.
  5. Optimization Tricks (for efficiency)

    • In practice, computing softmax over a large vocabulary is expensive.
    • Two common approximations are used:
      • Negative Sampling: Only update a small number of negative samples + one positive
      • Hierarchical Softmax: Replace softmax with a binary tree of decisions
  6. Output: Word Embedding Matrix

    • After training, the rows of the input embedding matrix EE (or average of input and output vectors) are used as final word vectors.
    • These vectors capture semantic and syntactic word relationships.

5.2.4 Learning Word Embedding: GloVe

GloVe.png

  1. Motivation: Combine global co-occurrence with local context

    • Word2Vec (Skip-gram) learns from local context windows.
    • GloVe is designed to use global word co-occurrence statistics across the entire corpus.
    • The key insight: ratios of co-occurrence probabilities capture semantic relationships.
  2. Co-occurrence Matrix: XijX_{ij}

    • Build a symmetric matrix XRV×VX \in \mathbb{R}^{V \times V} where:
      • XijX_{ij} = number of times word jj appears in the context of word ii
    • This matrix captures the global statistical information about the corpus.
  3. Learning Objective: Weighted least squares regression

    • Learn word vectors wi,wjw_i, w_j and biases bi,bjb_i, b_j for each word pair i,ji, j
    • The goal is to approximate: wiw~j+bi+b~jlog(Xij)w_i^\top \tilde{w}_j + b_i + \tilde{b}_j \approx \log(X_{ij})
    • Where:
      • wiw_i is the vector for the target word
      • w~j\tilde{w}_j is the vector for the context word
      • XijX_{ij} is the co-occurrence count
      • bib_i and b~j\tilde{b}_j are scalar biases
  4. Cost Function

    • Use a weighted squared error to train only on nonzero entries of XijX_{ij}: J=i,j=1Vf(Xij)(wiw~j+bi+b~jlog(Xij))2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

    • Weighting function:

      f(x)={(x/xmax)αif x<xmax1otherwisef(x) = \begin{cases} (x / x_{\text{max}})^\alpha & \text{if } x < x_{\text{max}} \\ 1 & \text{otherwise} \end{cases}
    • Typical values:

      • α=0.75\alpha = 0.75
      • xmax=100x_{\text{max}} = 100
  5. Output Word Vectors

    • Final word representation is typically: Embedding(wi)=wi+w~i\text{Embedding}(w_i) = w_i + \tilde{w}_i
    • Combines both target and context word embeddings.
  6. Key Properties

    • GloVe learns vectors such that vector differences encode meaningful relationships, e.g.:
      • kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen}
    • Works well for analogies and capturing linear semantic structure.

5.2.5 Applications Using Word Embeddings

  • Sentiment Classification

Sentiment-Classification-1.png

  • Debiasing Word Embeddings

    • Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model.

      Man: King as Woman: Queen

      Man: Computer Programmer as Woman: Homemaker

      Father: Doctor as Mother: Nurse

    • Addressing bias in word embeddings:

      1. Identify bias direction: average word embedding of (he, she), (male, female).
      2. Neutralize: For every word that is not definitional, project to get rid of bias.
      3. Equalize pairs.

5.3 Sequence Models & Attention Mechanism

5.3.1 Sequence To Sequence Model

Seq2Seq.png

  1. Architecture Overview

    • A Seq2Seq model consists of two main components:
      • Encoder: processes the input sequence into a fixed-length vector.
      • Decoder: generates the output sequence based on the encoder output.
    • Both are typically implemented using RNNs, LSTMs, or GRUs.
  2. Encoder: Read the input sequence

    For an input sequence x=(x1,x2,,xT)x = (x_1, x_2, \dots, x_T), the encoder RNN updates its hidden state at each time step: ht(enc)=RNN(xt,ht1(enc))h_t^{(enc)} = \text{RNN}(x_t, h_{t-1}^{(enc)})

    • The final hidden state hT(enc)h_T^{(enc)} is used to represent the entire input sequence.
    • In LSTM-based encoders, both hidden state hT(enc)h_T^{(enc)} and cell state cT(enc)c_T^{(enc)} are passed to the decoder.
  3. Decoder: Generate the output sequence

    Given the target sequence y=(y1,y2,,yT)y = (y_1, y_2, \dots, y_{T'}), the decoder predicts each output token one at a time: ht(dec)=RNN(yt1,ht1(dec))h_t^{(dec)} = \text{RNN}(y_{t-1}, h_{t-1}^{(dec)})

    • The initial hidden state h0(dec)h_0^{(dec)} is set to the encoder's final state: h0(dec)=hT(enc)h_0^{(dec)} = h_T^{(enc)}
    • At each time step, the decoder predicts the next word: y^t=softmax(Woutht(dec)+bout)\hat{y}_t = \text{softmax}(W_{out} h_t^{(dec)} + b_{out})
  4. Training: Teacher Forcing

    • During training, the decoder receives the true previous token yt1y_{t-1} as input.
    • This technique is called teacher forcing, and it helps the model converge faster.
  5. Inference: Autoregressive Decoding

    • During inference (e.g. translation), the decoder only has access to its own previous prediction: Input to decoder at time t:
      • Training: true y_{t-1}
      • Inference: predicted ŷ_{t-1}
    • Decoding stops when:
      • An end-of-sequence token is produced, or
      • A max length is reached
  6. Loss Function

    • Typically use cross-entropy loss over each predicted token: L=t=1TlogP(yty<t,x)\mathcal{L} = - \sum_{t=1}^{T'} \log P(y_t \mid y_{<t}, x)
    • The model learns to predict each output token given all previous ones and the encoded input.
  7. Enhancements

    • Bidirectional Encoder: Use BRNN to capture both past and future context in the input.
    • Attention Mechanism: Instead of using only the final encoder state, let the decoder attend to all encoder states dynamically.
    • Beam Search: Improves inference quality by exploring multiple decoding paths.
  8. Use Cases

    • Machine Translation (e.g., English → French)
    • Text Summarization
    • Dialogue Generation
    • Question Answering (closed-domain)

5.3.2 Beam Search

beam-search-1.png

beam-search-2.png

  1. Motivation: Why not greedy decoding?

    • Greedy decoding picks the word with the highest probability at each step.
    • However, this doesn't guarantee a globally optimal sequence.
    • It might choose a high-probability first word that leads to a poor overall sequence.
  2. Beam Search Idea

    Beam Search keeps track of the top-k most likely partial sequences at each time step, instead of just the single best one.

    • The number kk is called the beam width.
    • At each time step:
      • All possible continuations of each sequence in the beam are explored.
    • Only the top kk candidates (based on cumulative score) are kept.
  3. Beam Search Algorithm

    Suppose beam width = kk.

    1. Initialization:
      • Start with the initial token (e.g., <s><s>) in the beam.
      • Set initial score for all hypotheses = 0.
    2. At each decoding step:
      • For each sequence in the beam:
        • Expand it by all possible next words.
        • Compute the new cumulative log-probability: score=logP(y1)+logP(y2y1)++logP(yty<t)\text{score} = \log P(y_1) + \log P(y_2 \mid y_1) + \dots + \log P(y_t \mid y_{<t})
        • Retain the top kk scoring sequences.
    3. Termination:
      • Stop when all sequences in the beam have generated the </s></s> (end-of-sequence) token, or max length reached.
  4. Normalization

    • Longer sequences tend to have lower total log-probability.

    • To correct this, use length normalization: normalized score=1Tαt=1TlogP(yty<t)\text{normalized score} = \frac{1}{T^\alpha} \sum_{t=1}^{T} \log P(y_t \mid y_{<t})

    • Typical α\alpha values range from 0.6 to 1.0.

      1 is compeletly normalized to length. 0 is not normalizing at all.

  5. Comparison to Other Methods

    MethodDescriptionProsCons
    Greedy SearchChoose top-1 token at each stepFastestProne to local optima
    Beam SearchKeep top-k hypothesesBetter quality outputSlower than greedy
    Random SamplingSample from full probability dist.More diverse outputsLess stable / deterministic
    Top-k / Nucleus SamplingSample from top-k or p-massGood for open-genLess suitable for precise tasks
  6. Tips

    • Beam width 1 = Greedy decoding
    • Beam width 5~10 is commonly used in practice
    • Larger beam widths can improve quality, but too large may lead to generic/boring outputs ("I don't know" problem)

5.3.3 Bilingual Evaluation Understudy - BLEU Score

  1. Motivation: How to evaluate generated sequences?

    • We want to compare a generated sentence (candidate) against one or more reference sentences.
    • BLEU measures n-gram overlap between candidate and reference.
    • It's an automatic metric that correlates fairly well with human judgment.
  2. Input Definitions

    • Candidate sentence: the output generated by the model (e.g. machine translation).
    • Reference sentence(s): human-written correct outputs.
    • n-gram: contiguous sequence of n words.
  3. Core Idea

    Count how many n-grams in the candidate also appear in the reference(s).

    For each n-gram size (typically from 1 to 4):

    1. Count all n-grams in candidate sentence.
    2. Count how many appear in the reference(s).
    3. Clip the counts so repeats aren’t overcounted.
  4. Modified n-gram Precision

    For each nn-gram size: Pn=# matched n-grams# candidate n-gramsP_n = \frac{\text{\# matched n-grams}}{\text{\# candidate n-grams}}

    • This is called modified precision, because we "clip" the match count to the max times it appears in any reference.
  5. Brevity Penalty (BP)

    To penalize overly short outputs.

    Let:

    • cc = length of candidate
    • rr = effective reference length (usually closest to cc)

    Then:

    BP={1if c>rexp(1rc)if crBP = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - \frac{r}{c}) & \text{if } c \leq r \end{cases}
  6. BLEU Score Formula

    With weights wnw_n for each n-gram (commonly uniform, e.g. 0.25 for n=1~4): BLEU=BPexp(n=1NwnlogPn)\text{BLEU} = BP \cdot \exp\left( \sum_{n=1}^{N} w_n \log P_n \right)

    • Typically use up to 4-grams (N=4N = 4).
    • If any Pn=0P_n = 0, the BLEU becomes 0 (use smoothing in practice).
  7. Example (Simplified 2-gram BLEU)

    Reference: the cat is on the mat Candidate: the cat the cat on the mat

    • Unigram match: 5 / 7 = 0.714
    • Bigram match: 3 / 6 = 0.5
    • BLEU = BP0.7140.50.597BP \cdot \sqrt{0.714 \cdot 0.5} \approx 0.597
  8. Advantages

    • Automatic, fast, widely adopted
    • Works well with multiple references
    • Captures exact word-level overlap
  9. Limitations

    LimitationDescription
    Surface-levelDoesn't consider synonyms or paraphrasing
    Penalizes diversityModel using same phrases may get higher score
    Word order strictnessMinor reorderings can lower score a lot
    Better for machine translationLess effective for open-ended generation (like chatbots)
  10. Tips

    • BLEU is better with multiple references
    • Apply smoothing (e.g. BLEU+1) for short sentences
    • Combine with other metrics like ROUGE, METEOR, BERTScore for better evaluation

5.3.4 Attention Model

attention-model-1

  1. Motivation: Why Attention?

    • In standard Seq2Seq models, the encoder compresses the entire input sequence into a fixed-length vector.
    • This creates a bottleneck, especially for long sequences, because the decoder has limited access to detailed information.
    • Attention allows the decoder to “look back” at all encoder hidden states, dynamically focusing on the most relevant parts of the input.
  2. Encoder Hidden States

    • The encoder processes an input sequence x=(x1,x2,,xT)x = (x_1, x_2, \dots, x_T)
    • At each time step, it produces a hidden state: ht(enc)=EncoderRNN(xt,ht1(enc))h_t^{(enc)} = \text{EncoderRNN}(x_t, h_{t-1}^{(enc)})
    • These hidden states are collected into a sequence: H=[h1(enc),h2(enc),,hT(enc)]H = [h_1^{(enc)}, h_2^{(enc)}, \dots, h_T^{(enc)}]
  3. Decoder Hidden State

    • At decoding step tt, the decoder has a current hidden state sts_t: st=DecoderRNN(yt1,st1,ct)s_t = \text{DecoderRNN}(y_{t-1}, s_{t-1}, c_{t})
    • Note: the decoder will compute a context vector ctc_t using attention over the encoder outputs.

    attention-model-2

  4. Attention Weights: αt,i\alpha_{t,i}

    • For each encoder hidden state hih_i, compute a relevance score with respect to the decoder state sts_t: et,i=score(st,hi)e_{t,i} = \text{score}(s_t, h_i)
    • Common scoring functions:
      • Dot product: et,i=sthie_{t,i} = s_t^\top h_i
      • Additive (Bahdanau): et,i=vatanh(Wsst+Whhi)e_{t,i} = v_a^\top \tanh(W_s s_t + W_h h_i)
        • Scaled dot (Transformer): et,i=sthide_{t,i} = \frac{s_t^\top h_i}{\sqrt{d}}
    • Convert scores into attention weights using softmax: αt,i=exp(et,i)j=1Texp(et,j)\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}
  5. Context Vector: ctc_t

    • The context vector is a weighted sum of encoder hidden states: ct=i=1Tαt,ihic_t = \sum_{i=1}^{T} \alpha_{t,i} \cdot h_i
    • It represents what the decoder “attends to” when generating the next token.
  6. Final Decoder Output

    • The context vector ctc_t is combined with the decoder state sts_t to generate the output: s~t=tanh(Wc[ct;st])\tilde{s}_t = \tanh(W_c [c_t ; s_t]) y^t=softmax(Wos~t+bo)\hat{y}_t = \text{softmax}(W_o \tilde{s}_t + b_o)
    • The output is a probability distribution over the vocabulary.
  7. Summary of Steps at Time Step tt

    1. Compute attention scores et,ie_{t,i} between sts_t and each hih_i
    2. Normalize to get weights αt,i\alpha_{t,i}
    3. Compute context vector ctc_t as weighted sum
    4. Combine ctc_t and sts_t to predict output y^t\hat{y}_t
  8. Benefits of Attention

    • Removes fixed-length bottleneck
    • Improves translation quality, especially for long inputs
    • Offers interpretability: shows which parts of input are attended to
  9. Variants

    • Bahdanau Attention (additive)

      attention-Bahdanau

    • Luong Attention (multiplicative / dot-product)

      attention-Luong

    • Multi-head Attention (in Transformers)

5.3.5 Attentio Model Equations

Attention-Visualization.png

  1. Encoder: Process input sequence x=(x1,...,xT)x = (x_1, ..., x_T)

    At each time step tt:

    ht(enc)=RNNenc(xt,ht1(enc))h_t^{(enc)} = \text{RNN}_{enc}(x_t, h_{t-1}^{(enc)})
    • ht(enc)h_t^{(enc)}: the encoder's hidden state at time tt
    • Final encoder output: sequence of hidden states
    H=[h1(enc),h2(enc),...,hT(enc)]H = [h_1^{(enc)}, h_2^{(enc)}, ..., h_T^{(enc)}]
  2. Decoder Initialization

    At decoding time step tt:

    • sts_t: decoder hidden state at time step tt
    • Initial decoder state s0s_0 is usually set as:
    s0=tanh(WinithT(enc))s_0 = \tanh(W_{init} h_T^{(enc)})

    (Or simply s0=hT(enc)s_0 = h_T^{(enc)})

  3. Alignment score et,ie_{t,i}

    Compute how much attention the decoder should pay to each encoder state hih_i:

    et,i=vatanh(Wsst1+Whhi)e_{t,i} = v_a^\top \tanh(W_s s_{t-1} + W_h h_i)
    • st1s_{t-1}: decoder’s previous hidden state
    • hih_i: encoder hidden state at position ii
    • Ws,Wh,vaW_s, W_h, v_a: trainable weight matrices/vectors
  4. Attention weights αt,i\alpha_{t,i}

    Normalize the scores using softmax:

    αt,i=exp(et,i)j=1Texp(et,j)\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}
    • αt,i\alpha_{t,i} is the attention given to encoder step ii at decoder step tt
  5. Context vector ctc_t

    Use the attention weights to compute a weighted sum over all encoder hidden states:

    ct=i=1Tαt,ihic_t = \sum_{i=1}^{T} \alpha_{t,i} \cdot h_i
    • ctc_t is the attention-based summary of input relevant to the current decoding step
  6. Decoder hidden state sts_t

    Update the decoder's RNN using previous output and the current context:

    st=RNN([yt1;ct],  st1)s_t = \text{RNN}( [y_{t-1} ; c_t],\; s_{t-1} )
  7. Output prediction y^t\hat{y}_t

    Use sts_t and ctc_t to generate a prediction for the next word:

    s~t=tanh(Wc[st;ct])\tilde{s}_t = \tanh(W_c [s_t ; c_t]) y^t=softmax(Wos~t+bo)\hat{y}_t = \text{softmax}(W_o \tilde{s}_t + b_o)
    • y^t\hat{y}_t: probability distribution over vocabulary
  8. Overall Summary of Bahdanau Attention Steps

    At each decoder time step tt:

    1. Compute alignment scores for all encoder states: et,i=vatanh(Wsst1+Whhi)e_{t,i} = v_a^\top \tanh(W_s s_{t-1} + W_h h_i)
    2. Normalize to get attention weights: αt,i=softmax(et,i)\alpha_{t,i} = \text{softmax}(e_{t,i})
    3. Compute context vector: ct=αt,ihic_t = \sum \alpha_{t,i} \cdot h_i
    4. Update decoder hidden state: st=RNN(yt1,st1,ct)s_t = \text{RNN}(y_{t-1}, s_{t-1}, c_t)
    5. Generate next word distribution: y^t=softmax(Wotanh(Wc[st;ct])+bo)\hat{y}_t = \text{softmax}(W_o \tanh(W_c [s_t ; c_t]) + b_o)

5.4 Transformer Network

5.4.1 Self-Attention

Encoder Self Attention

  1. Motivation: Why Self-Attention?

    • In regular attention, the decoder attends to the encoder outputs.
    • In Self-Attention, each word in a sequence attends to other words in the same sequence.
    • This allows the model to capture dependencies between all tokens — no matter how far apart.
  2. Inputs: Embedding Matrix XX

    • Input is a sequence of token embeddings:
    X=[x1,x2,,xn]Rn×dX = [x_1, x_2, \dots, x_n]^\top \in \mathbb{R}^{n \times d}
    • Each xix_i is a dd-dimensional vector, nn is sequence length.
  3. Linear Projections: Queries, Keys, and Values

    QKV Explaine

    • Compute three matrices from XX using learnable weights:
    Q=XWQ,K=XWK,V=XWVQ = X W^Q, \quad K = X W^K, \quad V = X W^V
    • Dimensions:
      • WQ,WK,WVRd×dkW^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}
      • Q,K,VRn×dkQ, K, V \in \mathbb{R}^{n \times d_k}
  4. Scaled Dot-Product Attention

    QKV Attention Formula

    • For each token, compute attention score between its query and all keys:
    Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V
    • The matrix multiplication QKQ K^\top gives a n×nn \times n attention matrix.
    • The softmax ensures that attention weights over all positions sum to 1.
    • Then multiply by VV to get the weighted sum of value vectors.
  5. Output of Self-Attention

    • The final output is a new matrix:
    Z=Attention(Q,K,V)Z = \text{Attention}(Q, K, V)
    • Each row ziz_i is a new representation of token xix_i, enriched with contextual information from all other tokens.

5.4.2 Multi-Head Attention

Multi Head Attention Animation

  1. Motivation: Why Multiple Heads?

    • A single self-attention head can only capture one type of similarity or relationship at a time.
    • Multi-Head Attention allows the model to attend to information from different representation subspaces and at different positions in parallel.
  2. Input: Embedding Matrix XRn×dX \in \mathbb{R}^{n \times d}

    • nn = sequence length, dd = embedding dimension.
    • From input XX, we will create multiple independent query, key, value projections.
  3. Linear Projections for Each Head

    For each of the hh attention heads, project input into:

    Qi=XWiQ,Ki=XWiK,Vi=XWiVQ_i = X W_i^Q, \quad K_i = X W_i^K, \quad V_i = X W_i^V
    • Where:
      • WiQ,WiK,WiVRd×dkW_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d \times d_k}
        • Typically, dk=d/hd_k = d / h so that concatenated heads still match original dimensionality.
  4. Scaled Dot-Product Attention (per head)

    Each head computes:

    headi=Attention(Qi,Ki,Vi)=softmax(QiKidk)Vi\text{head}_i = \text{Attention}(Q_i, K_i, V_i) = \text{softmax} \left( \frac{Q_i K_i^\top}{\sqrt{d_k}} \right) V_i
    • Resulting shape: headiRn×dk\text{head}_i \in \mathbb{R}^{n \times d_k}
  5. Concatenation of All Heads

    • All heads are concatenated along the feature dimension:
    MultiHead(X)=[head1;head2;;headh]WO\text{MultiHead}(X) = [\text{head}_1; \text{head}_2; \dots; \text{head}_h] W^O
    • Where:
      • [][\,\cdot\,] means concatenation along last dimension
      • WORd×dW^O \in \mathbb{R}^{d \times d} is a learnable projection matrix to map back to original dimension
  6. Summary of Complete Flow

    Input X ∈ ℝⁿˣᵈ ↓ Project to h different Q, K, V sets (each with dimension d_k = d/h) ↓ Apply scaled dot-product attention for each head ↓ Concatenate all head outputs → [head₁; head₂; ...; headₕ] ∈ ℝⁿˣᵈ ↓ Final linear layer: project to output dimension d

5.4.3 Transformer Network

Transformer Model

Transformer Details

Transformer = Stacked Self-Attention + Feedforward + Residual + LayerNorm

  1. Motivation: Why Transformer?

    • RNNs suffer from sequential bottlenecks and long-term dependency issues.
    • Transformer replaces recurrence entirely with self-attention + feed-forward layers, enabling:
      • Full parallelization
        • Long-range dependency modeling
        • Simpler architecture
  2. Overall Structure

    Transformer follows an encoder-decoder structure:

    PLAINTEXT
    1Input → [Positional Encoding + Embedding] → Encoder → → → Context Vectors
    23                                       Decoder ← ← ← ← ← ← ← Output Sequence

    Each encoder/decoder is made up of multiple stacked identical blocks (usually 6 or 12).

    • Feed Forward Network (FFN)

    Transformer FFN

    • Residual Connections + Layer Norm

    Transformer Residual

    Transformer Layer Norm

  3. Encoder Block (each of N layers)

    Each encoder layer consists of:

    1. Multi-Head Self-Attention
    2. Feed-Forward Network (FFN)
    3. Residual Connections + LayerNorm

    Computation flow:

    PLAINTEXT
    1Input → Multi-Head Self-Attention → Add & Norm → FFN → Add & Norm → Output
  4. Decoder Block (each of N layers)

    Each decoder layer consists of:

    1. Masked Multi-Head Self-Attention (prevents attending to future tokens)
    2. Encoder-Decoder Attention (attend over encoder outputs)
    3. Feed-Forward Network (FFN)
    4. Residual Connections + LayerNorm

    Computation flow:

    PLAINTEXT
    1Target Input → Masked MHA → Add & Norm → MHA (Encoder Attention) → Add & Norm → FFN → Add & Norm → Output
  5. Positional Encoding

    Transformer-Positional Encoding

    Since Transformers lack recurrence, we inject position information:

    PE(pos,2i)=sin(pos100002i/d),PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{2i / d}} \right), PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{2i / d}} \right)
    • This allows the model to distinguish token order.
  6. Multi-Head Attention (recap)

    For input matrix XX:

    • Compute: Q=XWQ,K=XWK,V=XWVQ = X W^Q, \quad K = X W^K, \quad V = X W^V
    • Attention per head: Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V
    • Multi-head output: Concat(head1,...,headh)WO\text{Concat}(\text{head}_1, ..., \text{head}_h) W^O
  7. Feed-Forward Network (FFN)

    Each position passes through the same two-layer MLP:

    FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2
    • Typically W1Rd×dffW_1 \in \mathbb{R}^{d \times d_{ff}} where dffdd_{ff} \gg d (e.g., 2048 vs 512)
  8. Masking

    • Padding Mask: prevent attention to padding tokens.
    • Look-ahead Mask (causal mask): ensure decoder can't peek at future tokens during training.
  9. Output Layer

    • Decoder output is passed to a linear layer + softmax to predict the next token: y^t=softmax(Wozt+bo)\hat{y}_t = \text{softmax}(W_o z_t + b_o)