This notes' content are all based on https://www.coursera.org/specializations/deep-learning
Latex may have some issues when displaying.
1. Neural Networks and Deep Learning
1.1 Introduction to Deep Learning
1.1.1 Supervised Learning with Deep Learning
- Structured Data: Charts.
- Unstructured Data: Audio, Image, Text.
1.1.2 Scale drives deep learning progress
- The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
- Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.
1.2 Basics of Neural Network Programming
1.2.1 Binary Classification
- Input: X \in R^{nx}
- Output: 0, 1
1.2.2 Logistic Regression
- Given x, want \hat{y} = P(y=1|x)
- Input: x \in R^{n_x}
- Parameters: w \in R^{n_x}, b \in R
- Output \hat{y} = \sigma(w^Tx + b)
- $$\sigma(z)=\dfrac{1}{1+e^{-z}}$$
- If z large, \sigma(z)\approx\dfrac{1}{1+0}\approx1
- If z large negative number, \sigma(z)\approx\dfrac{1}{1+Bignum}\approx0
- Loss (error) function:
- $$\hat{y} = \sigma(w^Tx + b)$$, where $$\sigma(z)=\dfrac{1}{1+e^{-z}}$$
- $$z^{(i)}=w^Tx^{(i)}+b$$
- Want y^{(i)} \approx \hat{y}^{(i)}
- $$L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$$
- If y=1: L(\hat{y}, y)=-\log{\hat{y}} <- want \log{\hat{y}} as large as possible, want \hat{y} large
- If y=0: L(\hat{y}, y)=-\log{(1-\hat{y})} <- want \log{(1-\hat{y})} as large as possible, want \hat{y} small
- $$\hat{y} = \sigma(w^Tx + b)$$, where $$\sigma(z)=\dfrac{1}{1+e^{-z}}$$
Cost function
- $$J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits_{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$$
1.2.3 Gradient Descent
- Repeat w:=w-\alpha\dfrac{dJ(w)}{dw}; b:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}
- $$\alpha$$: Learning rate
- Right side of minimum, \dfrac{dJ(w)}{dw} > 0; Left side of minimum, \dfrac{dJ(w)}{dw}<0
- Logistic Regression Gradient Descent
- $$x_1,x_2,w_1,w_2,b$$
- $$z=w_1x_1+w_2x_2+b$$ -->$$a=\sigma(z)$$ -->$$L=(a,y)$$
- $$da=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}$$
- $$\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))$$
- $$\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}$$
- $$\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a} $$
- $$=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}$$
- $$dz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y$$
- $$=\dfrac{dL}{da}\cdot\dfrac{da}{dz}$$ ($$\dfrac{da}{dz}=a(1-a)$$)
- $$\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz$$
- $$\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz$$
- $$db=dz$$
- $$x_1,x_2,w_1,w_2,b$$
- Gradient Descent on m examples
- $$J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(a^{(i)},y^{(i)})$$
- $$\dfrac{\partial}{\partial w_1}J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})$$
- $$J=0;dw_1=0;dw_2=0;db=0$$
- for i=1 to m
- $$z^{(i)}=w^Tx^{(i)}+b$$
- $$a^{(i)}=\sigma (z^{(i)})$$
- $$J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]$$
- $$dz^{(i)}=a^{(i)}-y^{(i)}$$
- $$dw_1+=x_1^{(i)}dz^{(i)}$$ (for n = 2)
- $$dw_2+=x_2^{(i)}dz^{(i)}$$ (for n = 2)
- $$db+=dz^{(i)}$$
- $$J/=m;dw_1/=m;dw_2/=m;db/=m$$
- $$dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}$$
- $$w_1:=w_1-\alpha dw_1$$
- $$w_2:=w_2-\alpha dw_2$$
- $$b:=b-\alpha db$$
- for i=1 to m
1.2.4 Computational Graph
- $$J(a,b,c)=3(a+bc)$$
- $$u=bc$$
- $$v=a+u$$
- $$J=3v$$
- Left to right computation
- Derivatives with a Computation Graph
- $$\dfrac{dJ}{dv}=3$$
- $$\dfrac{dJ}{da}=3$$
- $$\dfrac{dv}{da}=1$$
- Chain Rule: \dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}
- $$\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6$$
- $$\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9$$
- $$\dfrac{dJ}{dv}=3$$
1.2.5 Vectorization
- avoid explicit for-loops.
- $$J=0;dw=np.zeros((n_x,1));db=0$$
- for i=1 to m
- $$z^{(i)}=w^Tx^{(i)}+b$$
- $$a^{(i)}=\sigma (z^{(i)})$$
- $$J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]$$
- $$dz^{(i)}=a^{(i)}-y^{(i)}$$
- $$dw+=x^{(i)}dz^{(i)}$$
- $$db+=dz^{(i)}$$
- $$J/=m;dw/=m;db/=m$$
- for i=1 to m
- $$Z=np.dot(w.T,x)+b$$ ; b(1,1)-->Broodcasting
- Vectorization Logistic Regression
- $$dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...$$
- $$dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}]$$ $$1\times m$$
- $$A=[a^{(1)}, a^{(2)}, ..., a^{(m)}]$$ $$Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]$$
- $$dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]$$
- Get rid of db and dw in for loop
- $$db=\dfrac{1}{m}\sum\limits_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)$$
- $$dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}]$$ $$n\times 1$$
- New Form of Logistic Regression
- $$Z=w^tX+b=np.dot(w.T, X)+b$$
- $$A=\sigma (Z)$$
- $$dz=A-Y$$
- $$dw=\dfrac{1}{m}\cdot X \cdot dZ^T$$
- $$db=\dfrac{1}{m}np.sum(dz)$$
- $$w:=w-\alpha dw$$
- $$b:=b-\alpha db$$
- Broadcasting(same as bsxfun in Matlab/Octave)
- $$(m,n)$$+-*/$$(1,n)$$->$$(m,n)$$ 1->m will be all the same number.
- $$(m,n)$$+-*/$$(m,1)$$->$$(m,n)$$ 1->n will be all the same number
- Don't use a = np.random.randn(5) a.shape = (5,) "rank 1 array"
- Use a = np.random.randn(5,1) or a = np.random.randn(1,5)
- Check by $$assert(a.shape (5,1))$$
- Fix rank 1 array by a = a.reshape((5,1))
- Logistic Regression Cost Function
- Lost
- $$p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}$$
- If y=1: p(y|x)=\hat{y}
- If y=0: p(y|x)=(1-\hat{y})
- $$\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)$$
- Cost
- $$\log p(labels\space in\space training\space set)=\log \Pi_{i=1}^{m}p(y^{(i)},x^{(i)})$$
- $$\log p(labels\space in\space training\space set)=\sum\limits_{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})$$
- Use maximum likelihood estimation(MLE)
- Cost(minmize): J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})
- Lost
1.3 Shallow Neural Networks
1.3.1 Neural Network Representation
Input layer, hidden layer, output layer
- $$a^{[0]}=x$$ -> $$a^{[1]}=[[a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4]]$$ -> $$a^{[2]}$$
- Layers count by # of hidden layer+# of output layer.
- $$x_1,x_2,x_3$$ -> $$4\space hidden\space nodes$$ -> $$Output\space layer$$
- First hidden node: z^{[1]}_1=w^{[1]T}_1+b^{[1]}_1, a^{[1]}_1=\sigma(z^{[1]}_1)
- Seconde hidden node: z^{[1]}_2=w^{[1]T}_2+b^{[1]}_2, a^{[1]}_2=\sigma(z^{[1]}_2)
- Third hidden node: z^{[1]}_3=w^{[1]T}_3+b^{[1]}_3, a^{[1]}_3=\sigma(z^{[1]}_3)
- Forth hidden node: z^{[1]}_4=w^{[1]T}_4+b^{[1]}_4, a^{[1]}_4=\sigma(z^{[1]}_4)
- Vectorization
- $$w^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \ -w^{[1]T}_2- \ -w^{[1]T}_3- \ -w^{[1]T}_4- \end{bmatrix}\end{gathered} (4,3)matrix$$
- $$z^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \ -w^{[1]T}_2- \ -w^{[1]T}_3- \ -w^{[1]T}_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \ x_2 \ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{[1]}_1 \ b^{[1]}_2 \b^{[1]}_3 \ b^{[1]}_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{[1]T}_1\cdot x+b^{[1]}_1 \ w^{[1]T}_2\cdot x+b^{[1]}_2 \ w^{[1]T}_3\cdot x++b^{[1]}_3 \ w^{[1]T}_4\cdot x+b^{[1]}_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{[1]}_1 \ z^{[1]}_2 \z^{[1]}_3 \ z^{[1]}_4 \end{bmatrix}\end{gathered}$$
- $$a^{[1]}=\begin{gathered}\begin{bmatrix}a^{[1]}_1 \ a^{[1]}_2 \a^{[1]}_3 \ a^{[1]}_4 \end{bmatrix}\end{gathered}=\sigma(z^{[1]})$$
- $$z^{[2]}=W^{[2]}\cdot a^{[1]}+b^{[2]}$$ $$(1, 1),(1, 4),(4, 1),(1, 1)$$
- $$a^{[2]}=\sigma(z^{[2]})$$ $$(1,1),(1,1)$$
- $$a^{2}: layer2; examplei$$
- for i=1 to m:
- Vectorizing of the above for loop
- $$X=\begin{gathered}\begin{bmatrix}| & | & | & | \ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrix$$ n is different hidden units
- $$Z^{[1]}=W^{[1]}\cdot X+b^{[1]}$$
- $$A^{[1]}=\sigma(Z^{[1]})$$
- $$Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}$$
- $$A^{[2]}=\sigma(Z^{[2]})$$
- hrizontally: training examples; vertically: hidden units
1.3.2 Activation Functions
- $$g^{[i]}$$: activation function of layer $$i$$
- Sigmoid: a=\dfrac{1}{1+e^{[-z]}}
- Tanh: a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}
- ReLU: a=max(0,z)
- Leaky ReLu: a=max(0.01z, z)
- Rules to choose activation function
- Output is between {0, 1}, choose sigmoid.
- Default choose ReLu.
- Why need non-liner activation function
- Use linear hidden layer will be useless to have multiple hidden layers. It will become a=w'x+b'.
- Linear may sometime use at output layer but with non-linear at hidden layers.
1.3.3 Forward and Backward Propogation
- Derivative of activation function
- Sigmoid: g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)
- Tanh: g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2
- ReLU: g'(z)=\left{\begin{array}{lr}0&if \space z<0 \1&if \space z\geq0\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.
- Leaky ReLU: g'(z)=\left{\begin{array}{lr}0.01&if \space z<0 \1&if \space z\geq0\end{array}\right.
- Gradient descent for neural networks
- Parameters: w^{[1]}(n^{[1]},n^{[2]}), b^{[1]}(n^{[2]},1),w^{[2]}(n^{[2]},n^{[1]}), b^{[2]}(n^{[2]},1)
- $$n_x=n^{[0]},n^{[1]},n^{[2]}=1$$
- Cost function: J(w^{[1]}, b^{[1]},w^{[2]}, b^{[2]})=\dfrac{1}{m}\sum\limits_{i=1}^nL(\hat{y},y)
- Forward propagation:
- $$Z^{[1]}=W^{[1]}\cdot X+b^{[1]}$$
- $$A^{[1]}=g^{[1]}(Z^{[1]})$$
- $$Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}$$
- $$A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})$$
- Back Propogation:
- $$dZ^{[2]}=A^{[2]}-Y$$ $$Y=[y^{(1)},y^{(2)},...,y^{(m)}]$$
- $$dW^{[2]}=\dfrac{1}{m}dZ^{[2]}A^{[1]T}$$
- $$db^{[2]}=\dfrac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)$$
- $$dZ^{[1]}=W^{[2]T}dZ^{[2]}*g'^{[1]}(Z^{1})$$
- $$(n^{[1]},m)->element-wise\space product->(n^{[1]},m)$$
- $$dW^{[1]}=\dfrac{1}{m}dZ^{[1]}X^{T}$$
- $$db^{[1]}=\dfrac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)$$
Random Initialization
- $$x_1,x_2->a_1^{[1]},a_2^{[1]}->a_1^{[2]}->\hat{y}$$
- $$w^{[1]}=np.random.randn((2,2))*0.01$$
- $$b^{[1]}=np.zeros((2,1))$$
- $$w^{[2]}=np.random.randn((1,2))*0.01$$
- $$b^{[2]}=0$$
1.4 Deep Neural Networks
1.4.1 Deep L-Layer Neural Network
- Deep neural network notation
- $$L=4$$ (#layers)
- $$n^{[l]}= #\space units\space in\space layer\space l $$
- $$n^{[1]}=5,n^{[2]}=5,n^{[3]}=3,n^{[4]}=n^{[l]}=1$$
- $$n^{[0]}=n_x=3$$
- $$a^{[l]}=activations\space in\space layer\space l$$
- $$a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}$$
- $$x=a^{[0]},\space \hat{y}=a^{l}$$
1.4.2 Forward Propagation in a Deep Network
- General: Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})
- $$x: z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}, a^{[1]}=g^{[1]}(z^{[1]})$$ $$a^{[0]}=X$$
- $$z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}, a^{[1]}=g^{[2]}(z^{[2]})$$
- ...
- $$z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}, a^{[4]}=g^{[4]}(z^{[4]})=\hat{y}$$
- Vectorizing:
- $$Z^{[1]}=w^{[1]}A^{[0]}+b^{[1]}, A^{[1]}=g^{[1]}(Z^{[1]})$$ $$A^{[0]}=X$$
- $$Z^{[2]}=w^{[2]}A^{[1]}+b^{[2]}, A^{[2]}=g^{[2]}(Z^{[2]})$$
- $$\hat{Y}=g(Z^{[4]})=A^{[4]}$$
- Matrix dimensions
- $$z^{[1]}=w^{[1]}\cdot x+b^{[1]}$$
- $$z^{[1]}=(3,1),w^{[1]}=(3,2),x=(2,1),b^{[1]}=(3,1)$$
- $$z^{[1]}=(n^{[1]},1),w^{[1]}=(n^{[1]},n^{[0]}),x=(n^{[0]},1),b^{[1]}=(n^{[1]},1)$$
- $$w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)$$
- $$z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1)$$ $$l=0, A^{[0]}=X=(n^{[0]},m)$$
- Earier layers learn simple features; later deeper layers put together to detect more complex things.
- Circuit theory and deep learning: Informally: There are functions you can compute with a "small" L-layer deep neural network that shallower networks require exponentially more hidden units to compute.
1.4.3 Building Blocks of Deep Neural Networks
- Forward and backward functions
- Layer l:w^{[l]},b^{[l]}
- Forward: Input a^{[l-1]}, output a^{[l]}
- $$z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]}$$ $$cache\space z^{[l]}$$
- $$a^{[l]}:g^{[l]}(z^{[l]})$$
- Backward: Input da^{[l]}, cache(z^{[l]}), output da^{[l-1]},dw^{[l]},db^{[l]}
- Forward propagation for layer l
- Input a^{[l-1]}, output a^{[l]},cache\space (z^{[l]})
- $$z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}$$
- $$a^{[l]}=g^{[l]}(z^{[l]})$$
- Vectoried
- $$Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}$$
- $$A^{[l]}=g^{[l]}(Z^{[l]})$$
- Input a^{[l-1]}, output a^{[l]},cache\space (z^{[l]})
- Backward propagation for layer l
- Input da^{[l]}, cache(z^{[l]}), output da^{[l-1]},dw^{[l]},db^{[l]}
- $$dz^{[l]}=da^{[l]}*g'^{[l]}(z^{[l]})$$
- $$dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}$$
- $$db^{[l]}=dz^{[l]}$$
- $$da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}$$
- Vectorized:
- $$dZ^{[l]}=dA^{[l]}*g'^{[l]}(Z^{[l]})$$
- $$dW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}$$
- $$db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)$$
- $$dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}$$
- Input da^{[l]}, cache(z^{[l]}), output da^{[l-1]},dw^{[l]},db^{[l]}
1.4.4 Parameters vs. Hyperparameters
- Parameters: W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]},...
- Hyperparameters (will affect/control/determine parameters):
- learning rate \alpha
- # iterations
- # of hidden units n^{[1]},n^{[2]},...
- # of hidden layers
- Choice of activation function
- Later: momemtum, minibatch size, regularization parameters,...
II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
2.1 Practical Aspects of Deep Learning
2.1.1 Train / Dev / Test sets
- Big data may need only 1% or even less dev/test sets.
- Mismatched: Make sure dev/test come from same distribution
- Not having a test set might be okay. (Only dev set.)
2.1.2 Bias / Variance
- Assume optimal (Bayes) error: \approx0\%
- High bias (underfitting): The prediction cannot classify different elemets as we want.
- Training set error 15\%, Dev set error 16\%.
- Training set error 15\%, Dev set error 30\%.
- "just right": The prediction perfectly classify different elemets as we want.
- Training set error 0.5\%, Dev set error 1\%.
- High variance (overfitting): The prediction 100% classify different elemets.
- Training set error 1\%, Dev set error 11\%.
- Training set error 15\%, Dev set error 30\%.
2.1.3 Basic Recipe for Machine Learning
2.1.3.1 Basic Recipe
- High bias(training data performance)
- Bigger network
- Train longer
- (NN architecture search)
- High variance (dev set performance)
- More data
- Regulairzation
- (NN architecture search)
2.1.3.2 Regularization
- Logistic regression. \min\limits_{w,b}J(w,b)
- $$w\in\mathbb{R}^{n_x}, b\in\mathbb{R}$$
- $$\lambda=regularization\space parameter$$
- $$J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||_2$$
- L2 regularization ||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw
- L1 regularization \dfrac{\lambda}{2m}\sum\limits_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||_1
- $$w$$ will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
- Neural network
- $$J(w^{[1]},b^{[1]},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits_{l=1}^{l}||w^2||_F$$
- $$||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2$$ $$w: (w^{[l]},w^{[l-1]})$$
- Frobenius norm: Square root of square sum of all elements in a matrix.
- $$dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}$$
- $$w^{[l]}:=w^{[l]}-\alpha dw^{[l]}$$ (keep the same)
- Weight decay
- $$w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]$$
- =w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)
- =(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)
- How does regularization prevent overfitting: \lambda bigger w^{[l]} smaller z^{[l]} smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
- Dropout regularization
- Implementing dropout("Inverted dropout")
- Illustrate with layer l=3 keep-prob=0.8 (means 0.2 chance get dropout/be 0 out)
- $$d3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep-prob$$ #This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
- $$a3 = np.multiply(a3, d3)$$ #a3*=d3; This will let some neruons been dropout
- $$a3/=keep-prob$$ #inverted dropout, keep the total avtivation the same before and after dropout.
- Why work: Can't rely on any one feature, so have to spread out weights.(shrink weights)
- First make sure the J is decreasing during iteration, then turn on dropout.
- Image: crop, flop, twist...
- Mid-size ||w||_F^2
- May caused optimize cost function and not overfir at the same time.
- Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
- Normalizing training sets
- $$x=\begin{gathered}\begin{bmatrix}x_1 \ x_2\end{bmatrix}\end{gathered}$$
- Subtract mean:
- $$\mu=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}$$
- $$x:=x-\mu$$
- Normalize variance:
- $$\sigma^2=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}2$$ "" element-wise
- $$x/=\sigma^2$$
- Use same \mu,\sigma^2 to normalize test set.
- Why normalize inputs?
- When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
- $$w^{[l]}>I$$ Just slightly, will make the gradient increase really fast (exploding).
- $$w^{[l]}<I$$ Just slightly, will make the gradient decrease really slow (varnishing).
- large n (number of input features) --> smaller w_i
- $$Variance(w:)=\dfrac{1}{n}$$ (sigmoid/tanh) ReLU: $$\dfrac{2}{n}$$ (variance can be a hyperparameter, DO NOT DO THAT)
- $$w^{[l]}=np.random.randn(shapeOfMatrix)*np.sqrt(\dfrac{1}{n^{[l-1]}})$$ ReLU: $$\dfrac{2}{n^{[l-1]}}$$
- Xavier initialization: \sqrt{\dfrac{1}{n^{[l-1]}})} Sometime \sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}
- $$\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}$$
- Take W^{[1]},b^{[1]},...,W^{[L]},b^{[L]} and reshape into a big vector \theta.
- Take dW^{[1]},db^{[1]},...,dW^{[L]},db^{[L]} and reshape into a big vector d\theta.
- for each i:
- $$d\theta_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}$$
- Check Euclidean distance \dfrac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2} (||.||_2 is Euclidean norm, sqare root of the sum of all elements' power of 2)
- take \epsilon=10^{-7}, if above Euclidean distance is \approx10^{-7} or smaller, is great.
- If is 10^{-5} or bigger may need to check.
- If is 10^{-3} or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
- notes:
- Don't use in training - only to debug.
- If algorithm fails grad check, look at components to try to identify bug.
- Remember regularization. (include the \dfrac{\lambda}{2m})
- Doesn't work with dropout. (since is random, implement without dropout)
- Run at random initialization; perhaps again after some training. (not work when w,b\approx0)
Comments | NOTHING