This notes' content are all based on https://www.coursera.org/specializations/deep-learning

Latex may have some issues when displaying.

1. Neural Networks and Deep Learning

1.1 Introduction to Deep Learning

1.1.1 Supervised Learning with Deep Learning

  • Structured Data: Charts.
  • Unstructured Data: Audio, Image, Text.

1.1.2 Scale drives deep learning progress

  • The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
  • Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.

1.2 Basics of Neural Network Programming

1.2.1 Binary Classification

  • Input: X \in R^{nx}
  • Output: 0, 1

1.2.2 Logistic Regression

  • Given x, want \hat{y} = P(y=1|x)
  • Input: x \in R^{n_x}
  • Parameters: w \in R^{n_x}, b \in R
  • Output \hat{y} = \sigma(w^Tx + b)
    • $$\sigma(z)=\dfrac{1}{1+e^{-z}}$$
    • If z large, \sigma(z)\approx\dfrac{1}{1+0}\approx1
    • If z large negative number, \sigma(z)\approx\dfrac{1}{1+Bignum}\approx0
  • Loss (error) function:
    • $$\hat{y} = \sigma(w^Tx + b)$$, where $$\sigma(z)=\dfrac{1}{1+e^{-z}}$$
      • $$z^{(i)}=w^Tx^{(i)}+b$$
    • Want y^{(i)} \approx \hat{y}^{(i)}

    • $$L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$$
      • If y=1: L(\hat{y}, y)=-\log{\hat{y}} <- want \log{\hat{y}} as large as possible, want \hat{y} large
      • If y=0: L(\hat{y}, y)=-\log{(1-\hat{y})} <- want \log{(1-\hat{y})} as large as possible, want \hat{y} small
  • Cost function

    • $$J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits_{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$$

1.2.3 Gradient Descent

  • Repeat w:=w-\alpha\dfrac{dJ(w)}{dw}; b:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}
    • $$\alpha$$: Learning rate
    • Right side of minimum, \dfrac{dJ(w)}{dw} > 0; Left side of minimum, \dfrac{dJ(w)}{dw}<0
  • Logistic Regression Gradient Descent
    • $$x_1,x_2,w_1,w_2,b$$
      • $$z=w_1x_1+w_2x_2+b$$ -->$$a=\sigma(z)$$ -->$$L=(a,y)$$
      • $$da=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}$$
        • $$\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))$$
        • $$\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}$$
        • $$\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a} $$
        • $$=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}$$
      • $$dz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y$$
        • $$=\dfrac{dL}{da}\cdot\dfrac{da}{dz}$$ ($$\dfrac{da}{dz}=a(1-a)$$)
      • $$\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz$$
      • $$\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz$$
      • $$db=dz$$
  • Gradient Descent on m examples
    • $$J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(a^{(i)},y^{(i)})$$
    • $$\dfrac{\partial}{\partial w_1}J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})$$
    • $$J=0;dw_1=0;dw_2=0;db=0$$
      • for i=1 to m
        • $$z^{(i)}=w^Tx^{(i)}+b$$
        • $$a^{(i)}=\sigma (z^{(i)})$$
        • $$J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]$$
        • $$dz^{(i)}=a^{(i)}-y^{(i)}$$
        • $$dw_1+=x_1^{(i)}dz^{(i)}$$ (for n = 2)
        • $$dw_2+=x_2^{(i)}dz^{(i)}$$ (for n = 2)
        • $$db+=dz^{(i)}$$
      • $$J/=m;dw_1/=m;dw_2/=m;db/=m$$
      • $$dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}$$
        • $$w_1:=w_1-\alpha dw_1$$
        • $$w_2:=w_2-\alpha dw_2$$
        • $$b:=b-\alpha db$$

1.2.4 Computational Graph

  • $$J(a,b,c)=3(a+bc)$$
    • $$u=bc$$
    • $$v=a+u$$
    • $$J=3v$$
    • Left to right computation
  • Derivatives with a Computation Graph
    • $$\dfrac{dJ}{dv}=3$$
      • $$\dfrac{dJ}{da}=3$$
      • $$\dfrac{dv}{da}=1$$
      • Chain Rule: \dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}
      • $$\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6$$
      • $$\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9$$

1.2.5 Vectorization

  • avoid explicit for-loops.
  • $$J=0;dw=np.zeros((n_x,1));db=0$$
    • for i=1 to m
      • $$z^{(i)}=w^Tx^{(i)}+b$$
      • $$a^{(i)}=\sigma (z^{(i)})$$
      • $$J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]$$
      • $$dz^{(i)}=a^{(i)}-y^{(i)}$$
      • $$dw+=x^{(i)}dz^{(i)}$$
      • $$db+=dz^{(i)}$$
    • $$J/=m;dw/=m;db/=m$$
  • $$Z=np.dot(w.T,x)+b$$ ; b(1,1)-->Broodcasting
  • Vectorization Logistic Regression
    • $$dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...$$
    • $$dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}]$$ $$1\times m$$
    • $$A=[a^{(1)}, a^{(2)}, ..., a^{(m)}]$$ $$Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]$$
    • $$dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]$$
    • Get rid of db and dw in for loop
      • $$db=\dfrac{1}{m}\sum\limits_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)$$
      • $$dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}]$$ $$n\times 1$$
    • New Form of Logistic Regression
      • $$Z=w^tX+b=np.dot(w.T, X)+b$$
      • $$A=\sigma (Z)$$
      • $$dz=A-Y$$
      • $$dw=\dfrac{1}{m}\cdot X \cdot dZ^T$$
      • $$db=\dfrac{1}{m}np.sum(dz)$$
      • $$w:=w-\alpha dw$$
      • $$b:=b-\alpha db$$
  • Broadcasting(same as bsxfun in Matlab/Octave)
    • $$(m,n)$$+-*/$$(1,n)$$->$$(m,n)$$ 1->m will be all the same number.
    • $$(m,n)$$+-*/$$(m,1)$$->$$(m,n)$$ 1->n will be all the same number
    • Don't use a = np.random.randn(5) a.shape = (5,) "rank 1 array"
    • Use a = np.random.randn(5,1) or a = np.random.randn(1,5)
    • Check by $$assert(a.shape (5,1))$$
    • Fix rank 1 array by a = a.reshape((5,1))
  • Logistic Regression Cost Function
    • Lost
      • $$p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}$$
      • If y=1: p(y|x)=\hat{y}
      • If y=0: p(y|x)=(1-\hat{y})
      • $$\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)$$
    • Cost
      • $$\log p(labels\space in\space training\space set)=\log \Pi_{i=1}^{m}p(y^{(i)},x^{(i)})$$
      • $$\log p(labels\space in\space training\space set)=\sum\limits_{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})$$
      • Use maximum likelihood estimation(MLE)
      • Cost(minmize): J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})

1.3 Shallow Neural Networks

1.3.1 Neural Network Representation

  • deep-learning-notes_1-3-1

  • Input layer, hidden layer, output layer

    • $$a^{[0]}=x$$ -> $$a^{[1]}=[[a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4]]$$ -> $$a^{[2]}$$
    • Layers count by # of hidden layer+# of output layer.
  • $$x_1,x_2,x_3$$ -> $$4\space hidden\space nodes$$ -> $$Output\space layer$$
    • First hidden node: z^{[1]}_1=w^{[1]T}_1+b^{[1]}_1, a^{[1]}_1=\sigma(z^{[1]}_1)
    • Seconde hidden node: z^{[1]}_2=w^{[1]T}_2+b^{[1]}_2, a^{[1]}_2=\sigma(z^{[1]}_2)
    • Third hidden node: z^{[1]}_3=w^{[1]T}_3+b^{[1]}_3, a^{[1]}_3=\sigma(z^{[1]}_3)
    • Forth hidden node: z^{[1]}_4=w^{[1]T}_4+b^{[1]}_4, a^{[1]}_4=\sigma(z^{[1]}_4)
  • Vectorization
    • $$w^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \ -w^{[1]T}_2- \ -w^{[1]T}_3- \ -w^{[1]T}_4- \end{bmatrix}\end{gathered} (4,3)matrix$$
    • $$z^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \ -w^{[1]T}_2- \ -w^{[1]T}_3- \ -w^{[1]T}_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \ x_2 \ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{[1]}_1 \ b^{[1]}_2 \b^{[1]}_3 \ b^{[1]}_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{[1]T}_1\cdot x+b^{[1]}_1 \ w^{[1]T}_2\cdot x+b^{[1]}_2 \ w^{[1]T}_3\cdot x++b^{[1]}_3 \ w^{[1]T}_4\cdot x+b^{[1]}_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{[1]}_1 \ z^{[1]}_2 \z^{[1]}_3 \ z^{[1]}_4 \end{bmatrix}\end{gathered}$$
    • $$a^{[1]}=\begin{gathered}\begin{bmatrix}a^{[1]}_1 \ a^{[1]}_2 \a^{[1]}_3 \ a^{[1]}_4 \end{bmatrix}\end{gathered}=\sigma(z^{[1]})$$
    • $$z^{[2]}=W^{[2]}\cdot a^{[1]}+b^{[2]}$$ $$(1, 1),(1, 4),(4, 1),(1, 1)$$
    • $$a^{[2]}=\sigma(z^{[2]})$$ $$(1,1),(1,1)$$
    • $$a^{2}: layer2; examplei$$
  • for i=1 to m:
    • $$z^{1}=W^{[1]}\cdot x(i)+b^{[1]}$$
    • $$a^{1}=\sigma(z^{1})$$
    • $$z^{2}=W^{[2]}\cdot a^{1}+b^{[2]}$$
    • $$a^{2}=\sigma(z^{2})$$
  • Vectorizing of the above for loop
    • $$X=\begin{gathered}\begin{bmatrix}| & | & | & | \ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrix$$ n is different hidden units
    • $$Z^{[1]}=W^{[1]}\cdot X+b^{[1]}$$
    • $$A^{[1]}=\sigma(Z^{[1]})$$
    • $$Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}$$
    • $$A^{[2]}=\sigma(Z^{[2]})$$
    • hrizontally: training examples; vertically: hidden units

1.3.2 Activation Functions

  • $$g^{[i]}$$: activation function of layer $$i$$
    • Sigmoid: a=\dfrac{1}{1+e^{[-z]}}
    • Tanh: a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}
    • ReLU: a=max(0,z)
    • Leaky ReLu: a=max(0.01z, z)
  • Rules to choose activation function
    1. Output is between {0, 1}, choose sigmoid.
    2. Default choose ReLu.
  • Why need non-liner activation function
    • Use linear hidden layer will be useless to have multiple hidden layers. It will become a=w'x+b'.
    • Linear may sometime use at output layer but with non-linear at hidden layers.

1.3.3 Forward and Backward Propogation

  • Derivative of activation function
    • Sigmoid: g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)
    • Tanh: g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2
    • ReLU: g'(z)=\left{\begin{array}{lr}0&if \space z<0 \1&if \space z\geq0\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.
    • Leaky ReLU: g'(z)=\left{\begin{array}{lr}0.01&if \space z<0 \1&if \space z\geq0\end{array}\right.
  • Gradient descent for neural networks
    • Parameters: w^{[1]}(n^{[1]},n^{[2]}), b^{[1]}(n^{[2]},1),w^{[2]}(n^{[2]},n^{[1]}), b^{[2]}(n^{[2]},1)
    • $$n_x=n^{[0]},n^{[1]},n^{[2]}=1$$
    • Cost function: J(w^{[1]}, b^{[1]},w^{[2]}, b^{[2]})=\dfrac{1}{m}\sum\limits_{i=1}^nL(\hat{y},y)
  • Forward propagation:
    • $$Z^{[1]}=W^{[1]}\cdot X+b^{[1]}$$
    • $$A^{[1]}=g^{[1]}(Z^{[1]})$$
    • $$Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}$$
    • $$A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})$$
  • Back Propogation:
    • $$dZ^{[2]}=A^{[2]}-Y$$ $$Y=[y^{(1)},y^{(2)},...,y^{(m)}]$$
    • $$dW^{[2]}=\dfrac{1}{m}dZ^{[2]}A^{[1]T}$$
    • $$db^{[2]}=\dfrac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)$$
    • $$dZ^{[1]}=W^{[2]T}dZ^{[2]}*g'^{[1]}(Z^{1})$$
      • $$(n^{[1]},m)->element-wise\space product->(n^{[1]},m)$$
    • $$dW^{[1]}=\dfrac{1}{m}dZ^{[1]}X^{T}$$

    • $$db^{[1]}=\dfrac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)$$
  • Random Initialization

    • $$x_1,x_2->a_1^{[1]},a_2^{[1]}->a_1^{[2]}->\hat{y}$$
    • $$w^{[1]}=np.random.randn((2,2))*0.01$$
    • $$b^{[1]}=np.zeros((2,1))$$
    • $$w^{[2]}=np.random.randn((1,2))*0.01$$
    • $$b^{[2]}=0$$

1.4 Deep Neural Networks

1.4.1 Deep L-Layer Neural Network

  • Deep neural network notation
    • deep-learning-notes_1-4-1
    • $$L=4$$ (#layers)
    • $$n^{[l]}= #\space units\space in\space layer\space l $$
      • $$n^{[1]}=5,n^{[2]}=5,n^{[3]}=3,n^{[4]}=n^{[l]}=1$$
      • $$n^{[0]}=n_x=3$$
    • $$a^{[l]}=activations\space in\space layer\space l$$
    • $$a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}$$
    • $$x=a^{[0]},\space \hat{y}=a^{l}$$

1.4.2 Forward Propagation in a Deep Network

  • General: Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})
    • $$x: z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}, a^{[1]}=g^{[1]}(z^{[1]})$$ $$a^{[0]}=X$$
    • $$z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}, a^{[1]}=g^{[2]}(z^{[2]})$$
    • ...
    • $$z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}, a^{[4]}=g^{[4]}(z^{[4]})=\hat{y}$$
  • Vectorizing:
    • $$Z^{[1]}=w^{[1]}A^{[0]}+b^{[1]}, A^{[1]}=g^{[1]}(Z^{[1]})$$ $$A^{[0]}=X$$
    • $$Z^{[2]}=w^{[2]}A^{[1]}+b^{[2]}, A^{[2]}=g^{[2]}(Z^{[2]})$$
    • $$\hat{Y}=g(Z^{[4]})=A^{[4]}$$
  • Matrix dimensions
    • deep-learning-notes_1-4-2
    • $$z^{[1]}=w^{[1]}\cdot x+b^{[1]}$$
    • $$z^{[1]}=(3,1),w^{[1]}=(3,2),x=(2,1),b^{[1]}=(3,1)$$
    • $$z^{[1]}=(n^{[1]},1),w^{[1]}=(n^{[1]},n^{[0]}),x=(n^{[0]},1),b^{[1]}=(n^{[1]},1)$$
    • $$w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)$$
    • $$z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1)$$ $$l=0, A^{[0]}=X=(n^{[0]},m)$$
  • Why deep representation?
    • Earier layers learn simple features; later deeper layers put together to detect more complex things.
    • Circuit theory and deep learning: Informally: There are functions you can compute with a "small" L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

1.4.3 Building Blocks of Deep Neural Networks

  • Forward and backward functions
    • deep-learning-notes_1-4-3
    • Layer l:w^{[l]},b^{[l]}
    • Forward: Input a^{[l-1]}, output a^{[l]}
      • $$z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]}$$ $$cache\space z^{[l]}$$
      • $$a^{[l]}:g^{[l]}(z^{[l]})$$
    • Backward: Input da^{[l]}, cache(z^{[l]}), output da^{[l-1]},dw^{[l]},db^{[l]}
  • One iteration of gradient descent of neural network
    • deep-learning-notes_1-4-3-2
  • How to implement?
    • Forward propagation for layer l
      • Input a^{[l-1]}, output a^{[l]},cache\space (z^{[l]})
        • $$z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}$$
        • $$a^{[l]}=g^{[l]}(z^{[l]})$$
      • Vectoried
        • $$Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}$$
        • $$A^{[l]}=g^{[l]}(Z^{[l]})$$
    • Backward propagation for layer l
      • Input da^{[l]}, cache(z^{[l]}), output da^{[l-1]},dw^{[l]},db^{[l]}
        • $$dz^{[l]}=da^{[l]}*g'^{[l]}(z^{[l]})$$
        • $$dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}$$
        • $$db^{[l]}=dz^{[l]}$$
        • $$da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}$$
      • Vectorized:
        • $$dZ^{[l]}=dA^{[l]}*g'^{[l]}(Z^{[l]})$$
        • $$dW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}$$
        • $$db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)$$
        • $$dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}$$

1.4.4 Parameters vs. Hyperparameters

  • Parameters: W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]},...
  • Hyperparameters (will affect/control/determine parameters):
    • learning rate \alpha
    • # iterations
    • # of hidden units n^{[1]},n^{[2]},...
    • # of hidden layers
    • Choice of activation function
  • Later: momemtum, minibatch size, regularization parameters,...

II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

2.1 Practical Aspects of Deep Learning

2.1.1 Train / Dev / Test sets

  • Big data may need only 1% or even less dev/test sets.
  • Mismatched: Make sure dev/test come from same distribution
  • Not having a test set might be okay. (Only dev set.)

2.1.2 Bias / Variance

deep-learning-notes_2-1-2

deep-learning-notes_2-1-2-2

  • Assume optimal (Bayes) error: \approx0\%
  • High bias (underfitting): The prediction cannot classify different elemets as we want.
    • Training set error 15\%, Dev set error 16\%.
    • Training set error 15\%, Dev set error 30\%.
  • "just right": The prediction perfectly classify different elemets as we want.
    • Training set error 0.5\%, Dev set error 1\%.
  • High variance (overfitting): The prediction 100% classify different elemets.
    • Training set error 1\%, Dev set error 11\%.
    • Training set error 15\%, Dev set error 30\%.

2.1.3 Basic Recipe for Machine Learning

2.1.3.1 Basic Recipe
  • High bias(training data performance)
    • Bigger network
    • Train longer
    • (NN architecture search)
  • High variance (dev set performance)
    • More data
    • Regulairzation
    • (NN architecture search)
2.1.3.2 Regularization
  • Logistic regression. \min\limits_{w,b}J(w,b)
    • $$w\in\mathbb{R}^{n_x}, b\in\mathbb{R}$$
    • $$\lambda=regularization\space parameter$$
    • $$J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||_2$$
    • L2 regularization ||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw
    • L1 regularization \dfrac{\lambda}{2m}\sum\limits_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||_1
      • $$w$$ will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
  • Neural network
    • $$J(w^{[1]},b^{[1]},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits_{l=1}^{l}||w^2||_F$$
    • $$||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2$$ $$w: (w^{[l]},w^{[l-1]})$$
      • Frobenius norm: Square root of square sum of all elements in a matrix.
    • $$dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}$$
      • $$w^{[l]}:=w^{[l]}-\alpha dw^{[l]}$$ (keep the same)
      • Weight decay
        • $$w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]$$
        • =w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)
        • =(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)
  • How does regularization prevent overfitting: \lambda bigger w^{[l]} smaller z^{[l]} smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
  • Dropout regularization
    • deep-learning-notes_2-1-3-2
    • Implementing dropout("Inverted dropout")
      • Illustrate with layer l=3 keep-prob=0.8 (means 0.2 chance get dropout/be 0 out)
      • $$d3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep-prob$$ #This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
      • $$a3 = np.multiply(a3, d3)$$ #a3*=d3; This will let some neruons been dropout
      • $$a3/=keep-prob$$ #inverted dropout, keep the total avtivation the same before and after dropout.
    • Why work: Can't rely on any one feature, so have to spread out weights.(shrink weights)
    • First make sure the J is decreasing during iteration, then turn on dropout.
  • Data augmentation
    • Image: crop, flop, twist...
  • Early stopping
    • Mid-size ||w||_F^2
    • May caused optimize cost function and not overfir at the same time.
  • Orthogonalization
    • Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
  • Normalizing training sets
    • deep-learning-notes_2-1-3-3
    • $$x=\begin{gathered}\begin{bmatrix}x_1 \ x_2\end{bmatrix}\end{gathered}$$
    • Subtract mean:
      • $$\mu=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}$$
      • $$x:=x-\mu$$
    • Normalize variance:
      • $$\sigma^2=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}2$$ "" element-wise
      • $$x/=\sigma^2$$
    • Use same \mu,\sigma^2 to normalize test set.
    • Why normalize inputs?
      • When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
      • deep-learning-notes_2-1-3-3-2
  • Vanishing/exploding gradients
    • $$w^{[l]}>I$$ Just slightly, will make the gradient increase really fast (exploding).
    • $$w^{[l]}<I$$ Just slightly, will make the gradient decrease really slow (varnishing).
  • Weight initalization (Single neuron)
    • large n (number of input features) --> smaller w_i
    • $$Variance(w:)=\dfrac{1}{n}$$ (sigmoid/tanh) ReLU: $$\dfrac{2}{n}$$ (variance can be a hyperparameter, DO NOT DO THAT)
    • $$w^{[l]}=np.random.randn(shapeOfMatrix)*np.sqrt(\dfrac{1}{n^{[l-1]}})$$ ReLU: $$\dfrac{2}{n^{[l-1]}}$$
    • Xavier initialization: \sqrt{\dfrac{1}{n^{[l-1]}})} Sometime \sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}
  • Numerical approximation of gradients
    • $$\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}$$
  • Gradient checking (Grad check)
    • Take W^{[1]},b^{[1]},...,W^{[L]},b^{[L]} and reshape into a big vector \theta.
    • Take dW^{[1]},db^{[1]},...,dW^{[L]},db^{[L]} and reshape into a big vector d\theta.
    • for each i:
      • $$d\theta_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}$$
      • Check Euclidean distance \dfrac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2} (||.||_2 is Euclidean norm, sqare root of the sum of all elements' power of 2)
      • take \epsilon=10^{-7}, if above Euclidean distance is \approx10^{-7} or smaller, is great.
      • If is 10^{-5} or bigger may need to check.
      • If is 10^{-3} or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
    • notes:
      • Don't use in training - only to debug.
      • If algorithm fails grad check, look at components to try to identify bug.
      • Remember regularization. (include the \dfrac{\lambda}{2m})
      • Doesn't work with dropout. (since is random, implement without dropout)
      • Run at random initialization; perhaps again after some training. (not work when w,b\approx0)

2.2 Optimization Algorithms

2.2.1 Mini-batch gradient descent

2.2.2 Exponentially weighted averages

2.2.3 RMSprop and Adam optimization

2.3 Hyperparameter Tuning, Batch Normalization, and Programming Frameworks

2.3.1 Tuning process

2.3.2 Using an appropriate scale to pick hyperparameters

2.3.3Batch Normalization

2.3.4 Multi-class classification

III. Structuring Machine Learning Projects

IV. Convolutional Neural Networks

V. Sequence Models