Recently, I am trying to learn more about image recognition. One of the topics that I am interested in is LeNet, which is a classic convolutional neural network. In this article, I will try to re-implement a LeNet using PyTorch and then abstractly understand some parts that I do not quite understand. The contents of this article may be inaccurate and will continue to update the details. Welcome to discuss with me about this article.
LeNet5 is one of the earliest Convolutional Neural Networks (CNNs). It was proposed by Yann LeCun and others in 1998.
importtorchimporttorch.nnasnnimporttorchvisionimporttorchvision.transformsastransforms# Define the batch size, which is the number of samples that will be passed to the neural network during training.batch_size=64# The "number" of classes is 10num_classes=10# Define the learning rate, which is the size of the steps that will be taken in each iteration during training.learning_rate=0.001# Define the number of epochs, which is the number of times that the entire dataset will be passed to the neural network.num_epochs=10# Define the device, which is the hardware that will be used for training.# If you are using a CPU-only machine, then you should use "cpu" instead of "mps".# Or if you are using GPU-only machine, then you should use "cuda" instead of "mps".device=torch.device('mps')
# Load the MNIST dataset.train_dataset=torchvision.datasets.MNIST(root='./data',train=True,# Here means that we are loading the dateaset for training.transform=transforms.Compose([# Here we define the way of preprocessing the data.transforms.Resize((32,32)),# Set the image to 32x32 pixels. The original MNIST images are 28x28 pixels. Here we increase their size to match the size of CIFAR-10 datasets for easier processing later on.transforms.ToTensor(),# Convert the image to a tensor and scale its pixel values between [0, 1].transforms.Normalize(mean=(0.1307,),std=(0.3081,))]),# Considering the MNIST dataset is a relatively small dataset, we also normalize its pixel values to have a mean of zero and standard deviation of one.download=True)test_dataset=torchvision.datasets.MNIST(root='./data',train=False,transform=transforms.Compose([transforms.Resize((32,32)),transforms.ToTensor(),transforms.Normalize(mean=(0.1325,),std=(0.3105,))]),download=True)# Build a PyTorch DataLoader object that provides the ability to load and iterate over batches of training data.# 'batch_size' set the number of samples in each batch, and'shuffle=True' means that the data will be reshuffled at every epoch end to increase randomness.train_loader=torch.utils.data.DataLoader(dataset=train_dataset,batch_size=batch_size,shuffle=True)test_loader=torch.utils.data.DataLoader(dataset=test_dataset,batch_size=batch_size,shuffle=True)
MNIST is a dataset of grayscale images, their pixel values are in the range 0 to 255. Gernerally, we want to normalize the data so that the mean value of each channel is zero and its standard deviation is one. They can use to be better for training and improve the accuracy.
transforms.Normalize(mean = (...), std = (...)) is a fonction that Pytorch provide, use to standerize the data. It will accept the two value, the mean and standard deviation of each channel. And here we pass the mean and standard deviation of MNIST.
Gernerally, when we compute the mean and standard deviation of each channel, we will use the following formula:
So, we can confirm the the different batch of MNIST have similar distribution. It is a common technique to train neural network and help the model converge faster and get better performance.
Normalize will do that:
normalized_image = (image - mean) / std
image is the pixel value of each image, and mean and std are the mean and standard deviation we provide. This transformation will scale all the features (in this case, the pixels) to a range with a mean of 0 and a standard deviation of 1.
It’s very imortant for training neural network, because the scale of features will affect the optimization algorithm. For example, if one feature’s range is 0 to 100 while another feature’s range is 0 to 1, the larger ranged feature will dominate the training process, because the optimization algorithm (such as gradient descent) will try to equally handle all features.
By standardization, we can confirm that the different batch of MNIST have similar distribution. It is a common technique to train neural network and help the model converge faster and get better performance.
classLeNet5(nn.Module):def__init__(self,num_classes):super(LeNet5,self).__init__()self.layer1=nn.Sequential(nn.Conv2d(1,6,kernel_size=5,stride=1,padding=0),nn.BatchNorm2d(6),nn.ReLU(),nn.MaxPool2d(kernel_size=2,stride=2))# self.layer1: The layer incude one Conv2d, one BatchNorm2d and one ReLU.# BatchNorm2d use to normalize the 6 channels, activation function ReLU and max pooling(MaxPool2d, use the 2x2 window, stride is 2).# On the first layer, we use the max pooling. There are some connmen ways to do the pooling, like average pooling、max pooling and K-max pooling.# The max pooling: choose the maximum pixel in the sub-block of the input feature map as the maximum pooling# The average pooling: choose the average pixel in the sub-block of the input feature map as the maximum pooling# The K-max pooling: do the max pooling on each channel of the input feature map and then connect them together asself.layer2=nn.Sequential(nn.Conv2d(6,16,kernel_size=5,stride=1,padding=0),nn.BatchNorm2d(16),nn.ReLU(),nn.MaxPool2d(kernel_size=2,stride=2))# BatchNorm2d use to normalize the 6 channels, activation function ReLU and max pooling(MaxPool2d, use the 2x2 window, stride is 2).self.fc=nn.Linear(400,120)self.relu=nn.ReLU()self.fc1=nn.Linear(120,84)self.relu1=nn.ReLU()self.fc2=nn.Linear(84,num_classes)# Here we define the three fully connected layers (self.fc, self.fc1, self.fc2) and two ReLU activation functions.# The fully-connected layers map the output of one layer into a specific dimension space for classification or regression tasks.# Here, the first fully connected layer converts 400 features obtained from convolution and pooling operations to 120 features.# The second fully connected layer converts the output of the first layer into a 84-dimensional space.defforward(self,x):out=self.layer1(x)out=self.layer2(out)out=out.reshape(out.size(0),-1)out=self.fc(out)out=self.relu(out)out=self.fc1(out)out=self.relu1(out)out=self.fc2(out)returnout# Here we define the way of forwarding the input data to compute the predicted results.# It will pass the first layer and second layer to process the input data, then flattens the data into a one-dimensional vector (reshape operation).# Then, it will pass the data through a fully connected layer, ReLU activation function and final prediction layer to compute the output.
# Here we create an instance of the LeNet5 model and move it to the device.model=LeNet5(num_classes).to(device)# Here we define the cross entropy loss function.# The cross entropy loss measures the distance between two probability distributions p and q.# According to the formula, the cross entropy loss is smaller when two probability distributions p and q are closer.# Observed distribution p is close to the target distribution q.cost=nn.CrossEntropyLoss()# Here we define the Adam optimizer to update the model weights. The model.parameters() method returns an iterator over all learnable parameters of the model, and passes them to the optimizer.# 'lr=learning_rate' Here we define the learning rate (lr), which is the size of the step taken by the optimizer to update the weights during training.optimizer=torch.optim.Adam(model.parameters(),lr=learning_rate)total_step=len(train_loader)
Adam is a common optimizer. It combines the advantages of momentum and RMSProp.
That is the result what I get after searching, is it like a mystery? Ha ha, I am too!
And someone told me that:
Adam use the momentum and adaptive learning rate to accelerate convergence. SGD-M add one order moment in SGD, AdaGrad and AdaDelta add two order moments (two order moment estimation) on SGD. Use both of them together is Adam - Adaptive + Momentum.
Here I read the 3.3.3 chapter of Professor Li’s deep learning course. The first two small chapters are about Adaptive Learning Rate. Then, I read the AdaGrad and RMSProp algorithms. The third small chapter is about Adam optimizer.
momentum method
We should first understand what momentum is. I am not very clear that before.
The course use the physical angle to understand momentum, it is so beautiful. Here we can understand momentum concept very well. The sum of past gradients is like a force that push the whole model forward. And the momentum is just a kind of acceleration. So, use momentum, we can make the learning rate more stable and converge faster.
Here we can find that, the different from SGD, Adam is more like a combination of SGD and Momentum. It uses momentum to accelerate the convergence, but also use the learning rate decay to avoid the oscillation.
The way of the AdaGrad let me think that, the AdaGrad is a method to adjust the learning rate. But it can’t solve the problem that, the different gradient use different learning rate. So, there is another way to solve this problem.
Adaptive learning rates: Adam computes individual learning rates for each parameter, speeding up convergence and improving the quality of the final solution.
Suitable for noisy gradients: Adam performs well in cases with noisy gradients, such as training deep learning models with mini-batches.
Low memory requirements: Adam requires only two additional variables for each parameter, making it memory-efficient.
Robust to the choice of hyperparameters: Adam is relatively insensitive to the choice of hyperparameters, making it easy to use in practice.
withtorch.no_grad():correct=0total=0forimages,labelsintest_loader:images=images.to(device)labels=labels.to(device)outputs=model(images)_,predicted=torch.max(outputs.data,1)total+=labels.size(0)correct+=(predicted==labels).sum().item()print('Accuracy of the network on the 10000 test images: {} %'.format(100*correct/total))
Output:
Accuracy of the network on the 10000 test images: 98.96 %
LeNet is a very classic convolutional neural network, first published in 1989. It is a good starting point for deep learning. After research the deatils of the model, I get a lot of knowledge about the model. But I only reimplement LeNet 5, there are many details that are not included in the original paper.
Thanks for reading this article, if you have any questions or suggestions, please email me.