Deep Learning深度学习-学习笔记
ZL Asica•2023-11-18
This notes' content are all based on https://www.coursera.org/specializations/deep-learning
Latex may have some issues when displaying.
1.1 Introduction to Deep Learning
1.1.1 Supervised Learning with Deep Learning
- Structured Data: Charts.
- Unstructured Data: Audio, Image, Text.
1.1.2 Scale drives deep learning progress
- The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
- Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.
1.2 Basics of Neural Network Programming
1.2.1 Binary Classification
- Input:
- Output: 0, 1
1.2.2 Logistic Regression
-
Given , want
-
Input:
-
Parameters:
-
Output
- If large,
- If large negative number,
-
Loss (error) function:
-
, where
-
Want
-
- If <- want as large as possible, want large
- If <- want as large as possible, want small
-
-
Cost function
1.2.3 Gradient Descent
- Repeat ;
- : Learning rate
- Right side of minimum, ; Left side of minimum,
- Logistic Regression Gradient Descent
-
- --> -->
-
-
- ()
-
- Gradient Descent on examples
-
- for to
- (for n = 2)
- (for n = 2)
-
- for to
1.2.4 Computational Graph
-
- Left to right computation
-
Derivatives with a Computation Graph
-
- Chain Rule:
-
1.2.5 Vectorization
-
avoid explicit for-loops.
-
- for to
- for to
-
; b(1,1)-->Broodcasting
-
Vectorization Logistic Regression
- Get rid of and in for loop
- New Form of Logistic Regression
-
Broadcasting(same as bsxfun in Matlab/Octave)
- +-*/-> 1->m will be all the same number.
- +-*/-> 1->n will be all the same number
- Don't use "rank 1 array"
- Use or
- Check by
- Fix rank 1 array by
-
Logistic Regression Cost Function
- Lost
- If :
- If :
- Cost
- Use maximum likelihood estimation(MLE)
- Cost(minmize):
- Lost
1.3 Shallow Neural Networks
1.3.1 Neural Network Representation
-
Input layer, hidden layer, output layer
- -> ->
- Layers count by # of hidden layer+# of output layer.
-
-> ->
- First hidden node:
- Seconde hidden node:
- Third hidden node:
- Forth hidden node:
-
Vectorization
- : layer ; example
-
for i=1 to m:
-
Vectorizing of the above for loop
- n is different hidden units
- hrizontally: training examples; vertically: hidden units
1.3.2 Activation Functions
-
: activation function of layer
- Sigmoid:
- Tanh:
- ReLU:
- Leaky ReLu:
-
Rules to choose activation function
- Output is between {0, 1}, choose sigmoid.
- Default choose ReLu.
-
Why need non-liner activation function
- Use linear hidden layer will be useless to have multiple hidden layers. It will become .
- Linear may sometime use at output layer but with non-linear at hidden layers.
1.3.3 Forward and Backward Propogation
-
Derivative of activation function
- Sigmoid:
- Tanh:
- ReLU:
- Leaky ReLU:
-
Gradient descent for neural networks
- Parameters:
- Cost function:
-
Forward propagation:
-
Back Propogation:
-
-
Random Initialization
1.4 Deep Neural Networks
1.4.1 Deep L-Layer Neural Network
- Deep neural network notation
- (#layers)
-
1.4.2 Forward Propagation in a Deep Network
-
General:
- ...
-
Vectorizing:
-
Matrix dimensions
-
Why deep representation?
- Earier layers learn simple features; later deeper layers put together to detect more complex things.
- Circuit theory and deep learning: Informally: There are functions you can compute with a "small" L-layer deep neural network that shallower networks require exponentially more hidden units to compute.
1.4.3 Building Blocks of Deep Neural Networks
-
Forward and backward functions
- Layer
- Forward: Input , output
- Backward: Input , output
-
One iteration of gradient descent of neural network
-
How to implement?
-
Forward propagation for layer
-
Input , output
-
Vectoried
-
-
Backward propagation for layer
-
Input , output
-
Vectorized:
-
-
1.4.4 Parameters vs. Hyperparameters
-
Parameters:
-
Hyperparameters (will affect/control/determine parameters):
- learning rate
- # iterations
- # of hidden units
- # of hidden layers
- Choice of activation function
-
Later: momemtum, minibatch size, regularization parameters,...
2.1 Practical Aspects of Deep Learning
2.1.1 Train / Dev / Test sets
- Big data may need only 1% or even less dev/test sets.
- Mismatched: Make sure dev/test come from same distribution
- Not having a test set might be okay. (Only dev set.)
2.1.2 Bias / Variance
- Assume optimal (Bayes) error:
- High bias (underfitting): The prediction cannot classify different elemets as we want.
- Training set error , Dev set error .
- Training set error , Dev set error .
- "just right": The prediction perfectly classify different elemets as we want.
- Training set error , Dev set error .
- High variance (overfitting): The prediction 100% classify different elemets.
- Training set error , Dev set error .
- Training set error , Dev set error .
2.1.3 Basic Recipe for Machine Learning
2.1.3.1 Basic Recipe
- High bias(training data performance)
- Bigger network
- Train longer
- (NN architecture search)
- High variance (dev set performance)
- More data
- Regulairzation
- (NN architecture search)
2.1.3.2 Regularization
- Logistic regression.
- L2 regularization
- L1 regularization
- will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
- Neural network
-
- Frobenius norm: Square root of square sum of all elements in a matrix.
-
- (keep the same)
- Weight decay
-
-
- How does regularization prevent overfitting: bigger smaller smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
- Dropout regularization
- Implementing dropout("Inverted dropout")
- Illustrate with layer (means 0.2 chance get dropout/be 0 out)
- #This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
- #a3*=d3; This will let some neruons been dropout
- #inverted dropout, keep the total avtivation the same before and after dropout.
- Why work: Can't rely on any one feature, so have to spread out weights.(shrink weights)
- First make sure the J is decreasing during iteration, then turn on dropout.
- Implementing dropout("Inverted dropout")
- Data augmentation
- Image: crop, flop, twist...
- Early stopping
- Mid-size
- May caused optimize cost function and not overfir at the same time.
- Orthogonalization
- Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
- Normalizing training sets
- Subtract mean:
- Normalize variance:
- "**" element-wise
- Use same to normalize test set.
- Why normalize inputs?
- When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
- Vanishing/exploding gradients
- Just slightly, will make the gradient increase really fast (exploding).
- Just slightly, will make the gradient decrease really slow (varnishing).
- Weight initalization (Single neuron)
- large (number of input features) --> smaller
- (sigmoid/tanh) ReLU: (variance can be a hyperparameter, DO NOT DO THAT)
- ReLU:
- Xavier initialization: Sometime
- Numerical approximation of gradients
- Gradient checking (Grad check)
- Take and reshape into a big vector .
- Take and reshape into a big vector .
- for each i:
- Check Euclidean distance ( is Euclidean norm, sqare root of the sum of all elements' power of 2)
- take , if above Euclidean distance is or smaller, is great.
- If is or bigger may need to check.
- If is or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
- notes:
- Don't use in training - only to debug.
- If algorithm fails grad check, look at components to try to identify bug.
- Remember regularization. (include the )
- Doesn't work with dropout. (since is random, implement without dropout)
- Run at random initialization; perhaps again after some training. (not work when )
2.2 Optimization Algorithms
2.2.1 Mini-batch gradient descent
- Batch vs. mini-batch gradient descent
- Normal batch may have large amount of data like millions of elements.
- set
- Mini-batches make 1,000 each.
- Mini-batch number
- ith in trainning set, layer in network batch in mini-batch
- Mini-batch number
- Normal batch may have large amount of data like millions of elements.
- Mini-batch gradient descent
- 1 step of gradient descent using (1000)
- 1 epoch: single pass through training set.
-
- Forward prop on
- 1 step of gradient descent using (1000)