Batch Normalization (BN)
When I first tried to understand Batch Normalization (BN), I went through a lot of resources, but I still found the specific implementation and purpose somewhat unclear. A few days ago, I asked my teacher for an explanation, and he gave me an example that made it much easier to grasp. I found it very helpful, and after going back to review other materials, everything started to make sense.
Principle
The core idea of Batch Normalization is to calculate the mean and standard deviation of the input data for each layer of the network, and then normalize this data so that its distribution is closer to a standard normal distribution (with a mean of 0 and a variance of 1). At the same time, to ensure that this normalization doesn’t hinder the model’s ability to learn, BN introduces two trainable parameters: a scaling parameter and a shift parameter. These allow the network to still learn different features even after the data has been normalized.
How to Understand It?
Batch Normalization (BN) can be understood through a simple analogy:
Imagine you’re in a class where everyone has to take a test, but before the test, everyone’s study conditions are different: some are in a good state, and others are not. If everyone takes the test directly, the results might vary greatly, affecting the overall performance.
To make the test fairer, the teacher decides to have everyone do some preparation exercises beforehand, such as deep breathing, relaxing, and getting everyone in a similar mental state. This way, each student’s initial state is closer to the others, and the test results are more stable.
In deep learning, Batch Normalization is similar to this preparation. It organizes the data before each layer of the network, making the data more consistent. Specifically, it normalizes the batch of data input to each layer, adjusting its distribution so that each layer of the network can process the data more stably and efficiently.
Let’s Look at an Example:
Suppose you have five students in your class, and their study conditions before the test (which we can think of as each student’s “score distribution”) are as follows:
- Xiaoming: Condition 90
- Xiaohong: Condition 70
- Xiaogang: Condition 50
- Xiaoli: Condition 60
- Xiaoqiang: Condition 80
If they take the test directly, the final scores might vary greatly due to the differences in their initial conditions. To make the test fairer, the teacher decides to do some “Batch Normalization” preparation activities before the test.
The Process of Batch Normalization
1. Calculate the Mean:
First, the teacher calculates the average condition:
\[ \text{Average Condition} = \frac{90 + 70 + 50 + 60 + 80}{5} = 70 \]2. Calculate the Standard Deviation:
Next, the teacher calculates the standard deviation of these conditions (which measures how spread out the data is):
\[ \text{Standard Deviation} = \sqrt{\frac{(90-70)^2 + (70-70)^2 + (50-70)^2 + (60-70)^2 + (80-70)^2}{5}} \approx 14.14 \]3. Normalize the Conditions:
Then, the teacher adjusts each student’s condition to make their distribution more uniform. This is done by subtracting the mean from each condition and then dividing by the standard deviation, making the data’s mean 0 and standard deviation 1.
- Xiaoming’s new condition: \(\frac{90 - 70}{14.14} \approx 1.41\)
- Xiaohong’s new condition: \(\frac{70 - 70}{14.14} = 0\)
- Xiaogang’s new condition: \(\frac{50 - 70}{14.14} \approx -1.41\)
- Xiaoli’s new condition: \(\frac{60 - 70}{14.14} \approx -0.71\)
- Xiaoqiang’s new condition: \(\frac{80 - 70}{14.14} \approx 0.71\)
Now, everyone’s conditions are concentrated in a narrower range (with the normalized distribution having a mean of 0 and a standard deviation of 1).
4. Restore Trainable Parameters:
To ensure that this uniform adjustment doesn’t affect their actual performance, the teacher introduces two “fine-tuning” parameters to adjust each student’s new condition back to an appropriate range. This is similar to the scaling parameter (\( \gamma \)) and shift parameter (\( \beta \)) in deep learning.
For example, if the teacher sets \( \gamma = 10 \) and \( \beta = 50 \), each student’s final condition becomes:
- Xiaoming: \(1.41 \times 10 + 50 = 64.1\)
- Xiaohong: \(0 \times 10 + 50 = 50\)
- Xiaogang: \(-1.41 \times 10 + 50 = 35.9\)
- Xiaoli: \(-0.71 \times 10 + 50 = 42.9\)
- Xiaoqiang: \(0.71 \times 10 + 50 = 57.1\)
This way, the students’ conditions have been normalized, but their original distribution characteristics are preserved, with just a bit of scaling and adjustment to ensure fairness and room for individual performance.
Purpose
After this series of adjustments, the students’ test conditions become more consistent, while still retaining their original differences. This process makes it less likely that a student’s score will fluctuate greatly due to varying conditions.
In deep learning, Batch Normalization serves a similar purpose. It stabilizes the distribution of input data for each layer, improving training efficiency and stability, while still preserving the original data characteristics.
Summary
The benefits of Batch Normalization include:
- Accelerated Training: Because the data distribution is more consistent, the network converges faster, speeding up the training process.
- Stability: It prevents issues like exploding or vanishing gradients, making the training process more stable.
- Reduced Sensitivity to Initial Parameters: Even if the network’s initial parameters aren’t ideal, BN helps the network train more effectively.
- Larger Learning Rates: With a better-normalized data distribution, larger learning rates can be used, further accelerating training.
Batch Normalization is like helping each layer of the network “breathe better,” putting them in a better state to process the input data, thus making the entire learning process more efficient and stable.